AllenAdministrator
(KiX Supporter)
2020-01-28 03:54 PM
Python/Selenium/HTML

Hey Guys, completely off topic, but I'm hoping someone could get me in the right direction. I've started using Python and Selenium to web scrape pages, to get bits of information for a larger script. I had been doing this with the GetPage() function but with IE becoming more and more old in the tooth, and with being able to use Chrome/FF in headless mode, I started trying to figure out the process of getting those data points a new way. I've gotten a bunch of them figured out, but I am struggling with one that is probably easy.

Here is the relevant html from the website:
 Quote:
<html class="windows" lang="en" current-version="x.x.x" stable-versions="x.x.x">


I need to get the current-version value. The old way I would have done this was to get the entire html source from GetPage, and then parse it. This works, but with Selenium you have access to the html structure.

A simple example is:

 Quote:

browser.get("https://www.libreoffice.org/download/download/")
x=browser.find_element_by_class_name("dl_version_number").text
print(x)


The other find methods are: (https://selenium-python.readthedocs.io/locating-elements.html)
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

So my question is... What method do I use to read that attribute value?


AllenAdministrator
(KiX Supporter)
2020-01-28 05:12 PM
Re: Python/Selenium/HTML

OMG... finally found it. \:\)

x=browser.find_element_by_tag_name("html").get_attribute("current-version")


BradV
(Seasoned Scripter)
2020-01-30 01:49 PM
Re: Python/Selenium/HTML

Just a thought, but have you tried using curl for windows? Curl is designed to be used from the command line/script. So, might be a lot easier to use?

AllenAdministrator
(KiX Supporter)
2020-01-30 06:16 PM
Re: Python/Selenium/HTML

Never used curl, and will certainly look at it. However, now that I have invested so much time learning how to use Selenium, I'm going to commit to writing up some of the things I've learned. I want to add the biggest reason for me going down the Selenium road was trying to get the html of pages generated with js or ajax. Most of the methods I tried would only return the html pre js/ajax. I read about PhantomJS, which does return the complete html, but then I kept reading that it was not being maintained anymore. Finally, my research got me in the direction of Chrome/Firefox in headless mode. I never wanted anything more than the page html source, but with Selenium, not only can you get the page source, you can also find almost anything in the html (parse it) very easily. The hard part was learning HOW to do it. Now that I've done that, I'm incorporating as much of it in kix as I can. \:\)

When I have time, I'll leave more details and examples.