|
|
|||||||
I googled web scraping and it returns a lot of hits using Python, VBScript, Twill?, etc. What I want to do is to pull HTML from webpages so I can parse them to find data I need to validate testing. A former co-worker used COM in a VBScript : Code:
I had a really tough time find out the properties and methods of the object in use. Can someone tell me if the above code works for what I need or tell me the best way to parse HTML using KiXtart? Thanks! |
||||||||
|
|
|||||||
what do you really want? that code above is not the right way if you want to get the whole of the html of a page. for that, a simple 3 line kixtart script will do just fine. |
||||||||
|
|
|||||||
Quote: Can you show me the three lines of code? |
||||||||
|
|
|||||||
$rc=SetOption('WrapAtEOL','on') $http=CreateObject("microsoft.xmlhttp") $http.Open("GET","http://www.kixtart.org/",Not 1) $http.send $value=$http.responsebody $value ? |
||||||||
|
|
|||||||
sorry, it takes four: Code:
thanks les. |
||||||||
|
|
|||||||
Thanks guys. How do you set the proxy configuration? |
||||||||
|
|
|||||||
it was a single registry value... just can't remember which one. |
||||||||
|
|
|||||||
Quote: Oh it can't be done with a property set for xmlhttp object? |
||||||||
|
|
|||||||
no. but you can avoid the registry value by pulling always a different url. that is, add something extra to the end, like "?some=fake&values=here" so the above example becomes: Code:
anyways, the registry setting is the best choice. it's a per user setting and you can always reset it once you don't need it anymore. |
||||||||
|
|
|||||||
Quote: I tried entering in fake values, but it didn't work. I think I may have a special case. The url contains a dll reference. http:/[ipaddress]/[navigation]/[dllname]?Login ex: http://10.10.10.1/kix/kix64.dll?Login |
||||||||
|
|
|||||||
think you best go with registry fix. what the fake does is force the update of the page. it does not override the proxy settings. you can disable to proxy for the script execution time and get rid of the issue. |
||||||||
|
|
|||||||
Can you give me an idea of how to go about writing the registry fix? Thanks for helping! |
||||||||
|
|
|||||||
k, had to crawl a huge amount of historical pages, back to some year 2003 but here is what I found: Code:
not sure what the values are for disabling proxy and crap but guess you can google with the syncmode5 keyword. |
||||||||
|
|
|||||||
does xmlhttp object use the proxy settings found in Internet Explorer Internet Settings? either way, i tried a test w/ the registry hack and w/o, but $http.responsebody is still returning null value. |
||||||||
|
|
|||||||
yes, xmlhttp is part of IE. and what registry value did you use? I know, the above setting does not bypass the proxy, it just forces a check of new data every time. think I said, you can go googling for the correct value yourself. |
||||||||
|
|
|||||||
According to Scripting Guy () SyncMode5 can be set to one of the four possible values : Every visit to the page 3 Every time you start Internet Explorer 2 Automatically 4 Never 0 I've tried all values, but none of them work. Here is my code : Code:
|
||||||||
|
|
|||||||
ok, let me ask, what is this dll? if you have the dll custom made there, why you need to pull out the html it produces? why can't you directly give it the info it wants or why don't you ask it directly? |
||||||||
|
|
|||||||
Quote: Hmmm, good question. Unfortunately I have no idea what's inside the dll. I'm new in the QA team. The testing tool we use can get capture the HTML content, but I need KiX or some other third-party tool to do it, so I can run it w/o the need to install the testing tool and parse the HTML content to pull the version spec. Is there any way to mimic programatically the feature for viewing the source in IE (View > Source)? |
||||||||
|
|
|||||||
the responsebody is exactly it. but thought your problem was the proxy? if you disable the proxy in ie and try your script, does it still fail to get the source? |
||||||||
|
|
|||||||
i currently have a proxy set for internet access. it looks like the website i want to scrape is an intranet that i can access w/o the proxy setup. so this is a different issue? i've tried disabling the proxy and tried all SyncMode5 values, but still i'm getting nothing back. anything special i need to do for intranet sites that use dll? |
||||||||
|
|
|||||||
add some error output: Code:
and what the error is: Code:
|
||||||||
|
|
|||||||
Unfortunately, console returned "exited without errors." |
||||||||
|
|
|||||||
k, download wget and try to pull the same addy with verbose output. what does it give you? |
||||||||
|
|
|||||||
excellent! it returned the html text to a file. what does this do that COM doesn't? |
||||||||
|
|
|||||||
It makes COM calls and API calls too, it's just more purpose built than the exposed elements of the COM object you're using here via KiXtart. |
||||||||
|
|
|||||||
so I'm guessing there is no solution in KiXtart? I'm satisfied with wget, but if there is a solution in KiXtart, I'm more than welcome to hear it. Thanks! |
||||||||
|
|
|||||||
Well we don't work there where you do, so we don't know the inner workings of your network and what all might be causing an issue. There are some things that KiXtart just can't natively do and socket manipulation is one of them. If WGET is doing what you want, then nothing wrong with it until such time as maybe KiXtart does support it. I think KiXforms has some support but how much I'm not sure. |
||||||||
|
|
|||||||
why I suggested wget is that if you pull a page with it in verbose mode, you will see if the server forces a page redirect or otherwise some odd behavior. oh, and xp security fix or your antivirus both can block the script as virus. |
||||||||
|
|
|||||||
oh, and did you try only one of responsebody and responsetext? or both and neither returned anything? don't remember what the one was with which you can return the response headers. |
||||||||
|
|
|||||||
I used both and it returned nothing back. Will I be notified if XP or anti-virus blocked a script? |
||||||||
|
|
|||||||
usually not. xp and some antiviruses may block stuff silently as being part of their ultimate security feature. |
||||||||
|
|
|||||||
Yep, rarely will a product accurately notify you why you can't remotely access it. It could be many things. On some of my systems with the Symantec VPN installed I can't remotely connect unless the driver is disabled. It just tells me the "Network path can not be found" Most firewalls are designed to not respond either on purpose so that the querying system won't send new or varied queries trying to find something open which is the right thing for them to do. |
||||||||
|
|
|||||||
*sigh* how frustrating. alas, i will use wget to extract the page. thanks for your help guys! |
||||||||
|
|
|||||||
hey, pearly... thought I asked it already, but... does wget verbose mode give you some insight what is happening? is there some redirection or something going on? |
||||||||
|
|
|||||||
This is what it says : Code: --10:13:08-- http://11.111.111.11/testing/xyz.dll?login |
||||||||
|
|
|||||||
how is your commandline for wget? the interesting part on the reply is: "=> `xyz.dll@login.1'" |
||||||||
|
|
|||||||
it's nothing unique. wget --verbose http://11.111.111.11/testing/xyz.dll?login |
||||||||
|
|
|||||||
so, do you get something with: http://11.111.111.11/testing/xyz.dll@login.1 |
||||||||
|
|
|||||||
'xyz.dll@login.1' is saved to the same location as the wget program and in it, is the full source code (of the login page). i tried wget --verbose http://11.111.111.11/testing/xyz.dll@login.1 but it gives me the following : Code:
also i couldn't get past the login screen with this program. i'm not sure how to enter in the login information so i can access the other pages. |
||||||||
|
|
|||||||
ja, I think I ask in the very beginning of this thread what you want to really achieve. getting the html does not help if you in fact want to pass information. anyway, logically, if that addy didn't, did you try the obvious: http://11.111.111.11/testing/xyz.dll@login |
||||||||
|
|
|||||||
I get the same results as before ERROR 404. Well the login page was main purpose for this project, but I thought wget can be leveraged for other usage. I guess I'll have to find some other way for secured pages. |
||||||||
|
|
|||||||
hmm... not sure you know still what you really want. or I don't know all the details still. what does it help to get the html of a login page? you get it once, don't you know it after that? |
||||||||
|
|
|||||||
The way Pearly keeps dancing around the truth, I think there is some malfeasance here. It looks to me that he is trying to steal passwords off the logon page. |
||||||||
|
|
|||||||
could be, could be. I'm still trusting and think that he is thinking the problem at hand too much piece by piece and not seeing the bigger picture and the steps needed. |
||||||||
|
|
|||||||
Quote: Hehe, unfortunately this is not the case. As said before, this is for my job as a QA Engineer. We parse HTML all the time, but the methods we use aren't the most efficient and efficacious. The reason for pulling the source from the login page is to retrieve the dll versions posted on the page. Developers display the versions and I need this for my testing validation. With the code I posted in my first post of this thread, I was able to parse HTML Tables once past the login page, but it's only working in VBA/VBS (see my other thread : http://www.kixtart.org/ubbthreads/showflat.php?Cat=0&Number=162816&an=0&page=0#162816) So all in all, I figured out how to pull data from HTML Tables and can access the dll versions. I can live with this for now. I'm not sure if wget or KiX COM is capable of accessing secured pages or possibly hook onto an existing browser that is already past the login page? I hope that answers what I'm trying to do. |
||||||||
|
|
|||||||
yes, and yes. you can use things like internetexplorer.application to get 100% IE compatibility. |
||||||||
|
|
|||||||
I know how to hook onto an existing browser, but then what do you do to access the source? |
||||||||
|
|
|||||||
can't really access the source. you need to access the elements and their data. |
||||||||
|
|
|||||||
hmm ok. I've converted the code in this thread : http://www.kixtart.org/ubbthreads/showflat.php?Cat=0&Number=162816&an=0&page=0#162816 to KiX. Any suggestions on how to easily identify parent/children relationships between objects, collections, and such? |
||||||||
|
|
|||||||
children you get via .items collection. don't remember can you access the parent as simply as by .parent |
||||||||
|
|
|||||||
take this example for instance : $objWinShell = CreateObject("Shell.Application") $obj = $objWinShell.Windows.Item(1).Document.All.Tags("TABLE").Item(6).Rows(0).Cells(0).InnerText How do I know the relationship between all these objects/collections/properties? Is there a property that lists all objects/collections/properties for that object? |
||||||||
|
|
|||||||
DOM questions are probably not in scope in this forum. I will seek my answers on my own. I appreciate all your help Jooel, and NTDOC. |