pearly
(Getting the hang of it)
2006-05-23 08:42 PM
Web scraping in KiX

I googled web scraping and it returns a lot of hits using Python, VBScript, Twill?, etc. What I want to do is to pull HTML from webpages so I can parse them to find data I need to validate testing.

A former co-worker used COM in a VBScript :

Code:

Set objWindowsShell = CreateObject("Shell.Application")
For varObjectIndex = 0 To objWindowsShell.Windows.Count - 1
If objWindowsShell.Windows(varObjectIndex).HWND = varHwnd Then
Set objIe = objWindowsShell.Windows(varObjectIndex)
Exit For
End If
Next
If varFrameContext = "" Then
Set objDocument = objIe.Document
Else
Set objDocument = objIe.Document.Frames(varFrameContext).Document
End If
Set objTables = objDocument.All.Tags("TABLE")
Set objTable = objTables.Item(varItemId)

If Not objTable.Rows.Length < varRow Then
If Not objTable.Rows(varRowIndex).Cells.Length < varCol Then
Select Case varPropertyName
Case "innerText"
varHtmlTableCell = Trim(objTable.Rows(varRowIndex).Cells(varColIndex).innerText)
Case "Image.Name"
varHtmlTableCell = objTable.Rows(varRowIndex).Cells(varColIndex).Images(0).name
End Select
Else
If IsNumeric(varItemId) Then varItemId = varItemId + 1
varErrorDetail = """" & "Type=HTMLTable;Index=" & varItemId & """" & ", " & """" & "Row=" & varRow & ";Col=" & varCol & """" & vbCrLf & "Col not found."
If Not ErrorMessagePersist_True(ErrorMessage, constrProcedureName, mconlngAutomationError, varErrorDetail) Then Exit Function
Exit Function
End If
Else
If IsNumeric(varItemId) Then varItemId = varItemId + 1
varErrorDetail = """" & "Type=HTMLTable;Index=" & varItemId & """" & ", " & """" & "Row=" & varRow & ";Col=" & varCol & """" & vbCrLf & "Row not found."
If Not ErrorMessagePersist_True(ErrorMessage, constrProcedureName, mconlngAutomationError, varErrorDetail) Then Exit Function
Exit Function
End If

Set objArguments = Nothing
Set objWindowsShell = Nothing
Set objIe = Nothing
Set objDocument = Nothing
Set objTables = Nothing



I had a really tough time find out the properties and methods of the object in use. Can someone tell me if the above code works for what I need or tell me the best way to parse HTML using KiXtart? Thanks!


LonkeroAdministrator
(KiX Master Guru)
2006-05-23 08:44 PM
Re: Web scraping in KiX

what do you really want?
that code above is not the right way if you want to get the whole of the html of a page.
for that, a simple 3 line kixtart script will do just fine.


pearly
(Getting the hang of it)
2006-05-23 09:01 PM
Re: Web scraping in KiX

Quote:

what do you really want?
that code above is not the right way if you want to get the whole of the html of a page.
for that, a simple 3 line kixtart script will do just fine.




Can you show me the three lines of code?


Les
(KiX Master)
2006-05-23 09:09 PM
Re: Web scraping in KiX

$rc=SetOption('WrapAtEOL','on')
$http=CreateObject("microsoft.xmlhttp")
$http.Open("GET","http://www.kixtart.org/",Not 1)
$http.send
$value=$http.responsebody
$value ?


LonkeroAdministrator
(KiX Master Guru)
2006-05-23 09:10 PM
Re: Web scraping in KiX

sorry, it takes four:
Code:

$http=CreateObject("microsoft.xmlhttp")
$http.Open("GET","http://www.kixtart.org/",Not 1)
$http.send
$http.responsebody ?



thanks les.


pearly
(Getting the hang of it)
2006-05-23 09:15 PM
Re: Web scraping in KiX

Thanks guys. How do you set the proxy configuration?

LonkeroAdministrator
(KiX Master Guru)
2006-05-23 09:21 PM
Re: Web scraping in KiX

it was a single registry value...
just can't remember which one.


pearly
(Getting the hang of it)
2006-05-23 09:24 PM
Re: Web scraping in KiX

Quote:

it was a single registry value...
just can't remember which one.




Oh it can't be done with a property set for xmlhttp object?


LonkeroAdministrator
(KiX Master Guru)
2006-05-23 09:32 PM
Re: Web scraping in KiX

no.
but you can avoid the registry value by pulling always a different url.
that is, add something extra to the end, like "?some=fake&values=here"
so the above example becomes:
Code:

$http=CreateObject("microsoft.xmlhttp")
$http.Open("GET","http://www.kixtart.org/?some=fake&values=here",Not 1)
$http.send
$http.responsebody ?



anyways, the registry setting is the best choice.
it's a per user setting and you can always reset it once you don't need it anymore.


pearly
(Getting the hang of it)
2006-05-23 10:17 PM
Re: Web scraping in KiX

Quote:

no.
but you can avoid the registry value by pulling always a different url.
that is, add something extra to the end, like "?some=fake&values=here"
so the above example becomes:
Code:

$http=CreateObject("microsoft.xmlhttp")
$http.Open("GET","http://www.kixtart.org/?some=fake&values=here",Not 1)
$http.send
$http.responsebody ?



anyways, the registry setting is the best choice.
it's a per user setting and you can always reset it once you don't need it anymore.




I tried entering in fake values, but it didn't work. I think I may have a special case. The url contains a dll reference.

http:/[ipaddress]/[navigation]/[dllname]?Login

ex: http://10.10.10.1/kix/kix64.dll?Login


LonkeroAdministrator
(KiX Master Guru)
2006-05-23 10:40 PM
Re: Web scraping in KiX

think you best go with registry fix.
what the fake does is force the update of the page.
it does not override the proxy settings.
you can disable to proxy for the script execution time and get rid of the issue.


pearly
(Getting the hang of it)
2006-05-23 10:45 PM
Re: Web scraping in KiX

Can you give me an idea of how to go about writing the registry fix?

Thanks for helping!


LonkeroAdministrator
(KiX Master Guru)
2006-05-23 11:19 PM
Re: Web scraping in KiX

k, had to crawl a huge amount of historical pages, back to some year 2003 but here is what I found:
Code:

$xK="HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings"
$cache=ReadValue($xK,"SyncMode5")
$=WriteValue($xK,"SyncMode5","3","reg_dword")



not sure what the values are for disabling proxy and crap but guess you can google with the syncmode5 keyword.


pearly
(Getting the hang of it)
2006-05-24 01:57 AM
Re: Web scraping in KiX

does xmlhttp object use the proxy settings found in Internet Explorer Internet Settings? either way, i tried a test w/ the registry hack and w/o, but $http.responsebody is still returning null value.

LonkeroAdministrator
(KiX Master Guru)
2006-05-24 07:13 AM
Re: Web scraping in KiX

yes, xmlhttp is part of IE.
and what registry value did you use?
I know, the above setting does not bypass the proxy, it just forces a check of new data every time.
think I said, you can go googling for the correct value yourself.


pearly
(Getting the hang of it)
2006-05-25 09:24 PM
Re: Web scraping in KiX

According to Scripting Guy ()

SyncMode5 can be set to one of the four possible values :

Every visit to the page 3
Every time you start Internet Explorer 2
Automatically 4
Never 0

I've tried all values, but none of them work. Here is my code :

Code:

Break On
GetPage("http://10.10.10.1/kix/kix64.dll?Login") ?
Sleep 10

Function GetPage($URL)
Dim $HTML, $IECacheKey, $IECacheVal
$IECacheKey = "HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings"
$IECacheVal = ReadValue($IECacheKey, "SyncMode5")
$IECacheVal ?
If $IECacheVal <> 3
$nul = WriteValue($IECacheKey, "SyncMode5", "3", "REG_DWORD")
EndIf
$HTML = CreateObject("microsoft.XMLhttp")
$HTML.Open("GET", $URL, Not 1)
$HTML.Send
If $HTML.Status = 200
$GetPage = $HTML.ResponseText ;or ResponseBody
Else
$GetPage = "HTTP Status Code: " + $HTML.Status + " (" + $HTML.StatusText + ")"
Exit 1
EndIf
$nul = WriteValue($IECacheKey, "SyncMode5", $IECacheVal, "REG_DWORD")
EndFunction



LonkeroAdministrator
(KiX Master Guru)
2006-05-25 10:05 PM
Re: Web scraping in KiX

ok, let me ask, what is this dll?
if you have the dll custom made there, why you need to pull out the html it produces?
why can't you directly give it the info it wants or why don't you ask it directly?


pearly
(Getting the hang of it)
2006-05-25 10:34 PM
Re: Web scraping in KiX

Quote:

ok, let me ask, what is this dll?
if you have the dll custom made there, why you need to pull out the html it produces?
why can't you directly give it the info it wants or why don't you ask it directly?




Hmmm, good question. Unfortunately I have no idea what's inside the dll. I'm new in the QA team. The testing tool we use can get capture the HTML content, but I need KiX or some other third-party tool to do it, so I can run it w/o the need to install the testing tool and parse the HTML content to pull the version spec.

Is there any way to mimic programatically the feature for viewing the source in IE (View > Source)?


LonkeroAdministrator
(KiX Master Guru)
2006-05-26 10:03 AM
Re: Web scraping in KiX

the responsebody is exactly it.
but thought your problem was the proxy?
if you disable the proxy in ie and try your script, does it still fail to get the source?


pearly
(Getting the hang of it)
2006-05-26 11:33 PM
Re: Web scraping in KiX

i currently have a proxy set for internet access. it looks like the website i want to scrape is an intranet that i can access w/o the proxy setup. so this is a different issue?

i've tried disabling the proxy and tried all SyncMode5 values, but still i'm getting nothing back.

anything special i need to do for intranet sites that use dll?


LonkeroAdministrator
(KiX Master Guru)
2006-05-27 01:23 AM
Re: Web scraping in KiX

add some error output:
Code:

$HTML = CreateObject("microsoft.XMLhttp")
if @error exit @error endif
$HTML.Open("GET", $URL, Not 1)
if @error exit @error endif
$HTML.Send
if @error exit @error endif
If $HTML.Status = 200
$GetPage = $HTML.Responsebody
if @error exit @error endif




and what the error is:
Code:

GetPage("http://10.10.10.1/kix/kix64.dll?Login") ? ?
if @error
"error occured." ?
"error: " @error ?
"descr: " @serror ?
else
"exited without errors." ?
endif



pearly
(Getting the hang of it)
2006-05-27 02:16 AM
Re: Web scraping in KiX

Unfortunately, console returned "exited without errors."



LonkeroAdministrator
(KiX Master Guru)
2006-05-27 02:31 AM
Re: Web scraping in KiX

k, download wget and try to pull the same addy with verbose output.
what does it give you?


pearly
(Getting the hang of it)
2006-05-27 03:47 AM
Re: Web scraping in KiX

excellent! it returned the html text to a file.

what does this do that COM doesn't?


NTDOCAdministrator
(KiX Master)
2006-05-27 03:53 AM
Re: Web scraping in KiX

It makes COM calls and API calls too, it's just more purpose built than the exposed elements of the COM object you're using here via KiXtart.

pearly
(Getting the hang of it)
2006-05-27 04:03 AM
Re: Web scraping in KiX

so I'm guessing there is no solution in KiXtart? I'm satisfied with wget, but if there is a solution in KiXtart, I'm more than welcome to hear it. Thanks!

NTDOCAdministrator
(KiX Master)
2006-05-27 04:12 AM
Re: Web scraping in KiX

Well we don't work there where you do, so we don't know the inner workings of your network and what all might be causing an issue.

There are some things that KiXtart just can't natively do and socket manipulation is one of them.

If WGET is doing what you want, then nothing wrong with it until such time as maybe KiXtart does support it.

I think KiXforms has some support but how much I'm not sure.


LonkeroAdministrator
(KiX Master Guru)
2006-05-27 04:49 PM
Re: Web scraping in KiX

why I suggested wget is that if you pull a page with it in verbose mode, you will see if the server forces a page redirect or otherwise some odd behavior.

oh, and xp security fix or your antivirus both can block the script as virus.


LonkeroAdministrator
(KiX Master Guru)
2006-05-27 06:01 PM
Re: Web scraping in KiX

oh, and did you try only one of responsebody and responsetext?
or both and neither returned anything?

don't remember what the one was with which you can return the response headers.


pearly
(Getting the hang of it)
2006-05-30 11:26 PM
Re: Web scraping in KiX

I used both and it returned nothing back. Will I be notified if XP or anti-virus blocked a script?

LonkeroAdministrator
(KiX Master Guru)
2006-05-31 07:56 AM
Re: Web scraping in KiX

usually not.
xp and some antiviruses may block stuff silently as being part of their ultimate security feature.


NTDOCAdministrator
(KiX Master)
2006-05-31 11:03 AM
Re: Web scraping in KiX

Yep, rarely will a product accurately notify you why you can't remotely access it.

It could be many things. On some of my systems with the Symantec VPN installed I can't remotely connect unless the driver is disabled. It just tells me the "Network path can not be found"

Most firewalls are designed to not respond either on purpose so that the querying system won't send new or varied queries trying to find something open which is the right thing for them to do.


pearly
(Getting the hang of it)
2006-05-31 06:55 PM
Re: Web scraping in KiX

*sigh* how frustrating. alas, i will use wget to extract the page. thanks for your help guys!

LonkeroAdministrator
(KiX Master Guru)
2006-06-01 06:04 PM
Re: Web scraping in KiX

hey, pearly...
thought I asked it already, but...

does wget verbose mode give you some insight what is happening?
is there some redirection or something going on?


pearly
(Getting the hang of it)
2006-06-01 07:15 PM
Re: Web scraping in KiX

This is what it says :

Code:
--10:13:08--  http://11.111.111.11/testing/xyz.dll?login
=> `xyz.dll@login.1'
Connecting to 11.111.111.11:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified

[ <=>

10:13:08 (58.24 KB/s) - `xyz.dll@login.1' saved [27169]



LonkeroAdministrator
(KiX Master Guru)
2006-06-02 05:40 PM
Re: Web scraping in KiX

how is your commandline for wget?
the interesting part on the reply is:
"=> `xyz.dll@login.1'"


pearly
(Getting the hang of it)
2006-06-02 06:20 PM
Re: Web scraping in KiX

it's nothing unique.

wget --verbose http://11.111.111.11/testing/xyz.dll?login


LonkeroAdministrator
(KiX Master Guru)
2006-06-02 07:33 PM
Re: Web scraping in KiX

so, do you get something with:
http://11.111.111.11/testing/xyz.dll@login.1


pearly
(Getting the hang of it)
2006-06-02 07:45 PM
Re: Web scraping in KiX

'xyz.dll@login.1' is saved to the same location as the wget program and in it, is the full source code (of the login page). i tried

wget --verbose http://11.111.111.11/testing/xyz.dll@login.1

but it gives me the following :

Code:

--10:37:50-- http://11.111.111.11/testing/xyz.dll@login.1
=> `xyz.dll@login.1.1'
Connecting to 11.111.111.11:80... connected.
HTTP request sent, awaiting response... 404 Object Not Found
10:37:50 ERROR 404: Object Not Found.



also i couldn't get past the login screen with this program. i'm not sure how to enter in the login information so i can access the other pages.


LonkeroAdministrator
(KiX Master Guru)
2006-06-02 11:05 PM
Re: Web scraping in KiX

ja, I think I ask in the very beginning of this thread what you want to really achieve.
getting the html does not help if you in fact want to pass information.

anyway, logically, if that addy didn't, did you try the obvious:
http://11.111.111.11/testing/xyz.dll@login


pearly
(Getting the hang of it)
2006-06-03 12:49 AM
Re: Web scraping in KiX

I get the same results as before ERROR 404.

Well the login page was main purpose for this project, but I thought wget can be leveraged for other usage. I guess I'll have to find some other way for secured pages.


LonkeroAdministrator
(KiX Master Guru)
2006-06-03 12:13 PM
Re: Web scraping in KiX

hmm...
not sure you know still what you really want. or I don't know all the details still.

what does it help to get the html of a login page?
you get it once, don't you know it after that?


Les
(KiX Master)
2006-06-03 03:31 PM
Re: Web scraping in KiX

The way Pearly keeps dancing around the truth, I think there is some malfeasance here. It looks to me that he is trying to steal passwords off the logon page.

LonkeroAdministrator
(KiX Master Guru)
2006-06-03 07:32 PM
Re: Web scraping in KiX

could be, could be.
I'm still trusting and think that he is thinking the problem at hand too much piece by piece and not seeing the bigger picture and the steps needed.


pearly
(Getting the hang of it)
2006-06-05 07:08 PM
Re: Web scraping in KiX

Quote:

The way Pearly keeps dancing around the truth, I think there is some malfeasance here. It looks to me that he is trying to steal passwords off the logon page.




Hehe, unfortunately this is not the case. As said before, this is for my job as a QA Engineer. We parse HTML all the time, but the methods we use aren't the most efficient and efficacious. The reason for pulling the source from the login page is to retrieve the dll versions posted on the page. Developers display the versions and I need this for my testing validation. With the code I posted in my first post of this thread, I was able to parse HTML Tables once past the login page, but it's only working in VBA/VBS (see my other thread : http://www.kixtart.org/ubbthreads/showflat.php?Cat=0&Number=162816&an=0&page=0#162816)

So all in all, I figured out how to pull data from HTML Tables and can access the dll versions. I can live with this for now. I'm not sure if wget or KiX COM is capable of accessing secured pages or possibly hook onto an existing browser that is already past the login page?

I hope that answers what I'm trying to do.


LonkeroAdministrator
(KiX Master Guru)
2006-06-05 07:27 PM
Re: Web scraping in KiX

yes, and yes.
you can use things like internetexplorer.application to get 100% IE compatibility.


pearly
(Getting the hang of it)
2006-06-05 07:32 PM
Re: Web scraping in KiX

I know how to hook onto an existing browser, but then what do you do to access the source?

LonkeroAdministrator
(KiX Master Guru)
2006-06-05 08:55 PM
Re: Web scraping in KiX

can't really access the source.
you need to access the elements and their data.


pearly
(Getting the hang of it)
2006-06-05 09:58 PM
Re: Web scraping in KiX

hmm ok. I've converted the code in this thread : http://www.kixtart.org/ubbthreads/showflat.php?Cat=0&Number=162816&an=0&page=0#162816 to KiX. Any suggestions on how to easily identify parent/children relationships between objects, collections, and such?

LonkeroAdministrator
(KiX Master Guru)
2006-06-05 10:08 PM
Re: Web scraping in KiX

children you get via .items collection.
don't remember can you access the parent as simply as by .parent


pearly
(Getting the hang of it)
2006-06-05 10:35 PM
Re: Web scraping in KiX

take this example for instance :

$objWinShell = CreateObject("Shell.Application")

$obj = $objWinShell.Windows.Item(1).Document.All.Tags("TABLE").Item(6).Rows(0).Cells(0).InnerText

How do I know the relationship between all these objects/collections/properties? Is there a property that lists all objects/collections/properties for that object?


pearly
(Getting the hang of it)
2006-06-06 12:39 AM
Re: Web scraping in KiX

DOM questions are probably not in scope in this forum. I will seek my answers on my own. I appreciate all your help Jooel, and NTDOC.