Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

I need to automate/scrape data from IE

by CarlosT (Initiate)
on Dec 07, 2011 at 00:37 UTC ( #942141=perlquestion: print w/ replies, xml ) Need Help??
CarlosT has asked for the wisdom of the Perl Monks concerning the following question:

I've got a task that is just screaming for automation. Every week, I have to get a number for each of 36 entities for some metrics I do and that basically consists of counting the 'Y's in a certain column in a table on a company web page. Each entity requires picking a value in a dropdown, refreshing the page, and counting 'Y's. It's a slow, cumbersome, tedious, and vulnerable to error process. What I'd love is to point perl at the site and get back the numbers quickly and cleanly.

Here's what I do know (I don't know what matters):
  • The site uses kerberos for authentication
  • The site uses SSL
  • the page only works reliably in Internet Explorer
I have no previous experience with web automation, so I'm flying fairly blind. I tried using LWP, but couldn't connect because of SSL issues. I then gave up on perl for a while and tried using greasemonkey, but that was when I discovered that the page didn't actually work with Firefox. So most recently I've been trying to use Win32-IEAutomation, but haven't been able to get that off the ground either. This is what I currently have:
#!/usr/local/bin/perl use Win32::IEAutomation; # Create new instance of IE my $ie =- Win32::IEAutomation->new ( visible => 1, maximize => 1); my $url = 'https://internal.site.of.doom/'; $ie->gotoURL($url);
That gets me a blank IE window and an error message reading "Could not start AutoItX3 Control through OLE"

Anyone have any ideas?

Thanks,

Carlos

Comment on I need to automate/scrape data from IE
Download Code
Re: I need to automate/scrape data from IE
by grantm (Parson) on Dec 07, 2011 at 01:31 UTC

    If the page only works with IE then there's a chance that it uses ActiveX - the core of the HTML page would be an <object> tag with a bunch of ugly parameters. If that is what you're getting then one or more of the parameters might be URLs that you could try accessing directly. But if it does use ActiveX and you can't access the data URLs directly then you're pretty much screwed.

    Is this for your TPS reports?

      i have launched IE with URL using IEAutomation. now i need to navigate to test box , i am using getTextBox method but getting error , no text box present with specificed option name (as well i can see focus is in cmd prompt it doesent goes to IE) anyone is hainvg any idea about it.
Re: I need to automate/scrape data from IE
by Anonymous Monk on Dec 07, 2011 at 02:35 UTC
Re: I need to automate/scrape data from IE
by JavaFan (Canon) on Dec 07, 2011 at 07:20 UTC
    I tried using LWP, but couldn't connect because of SSL issues.
    Can you be a bit more specific? LWP ought to be able to process https requests.

    Instead of screen scraping, you could also try to find out where the page gets its data from, and just go straight to the source.

      That would be my first choice as well, but I don't have access to that.
Re: I need to automate/scrape data from IE
by hawtin (Prior) on Dec 07, 2011 at 09:00 UTC

    The message you are getting back suggests that just using OLE won't work, however it is worth trying the simplest approach (just to prove that it won't do it).

    use strict; use Win32::OLE; my $ie = Win32::OLE->new( 'InternetExplorer.Application' ) or die "error starting IE"; $ie->{visible} = 1; $ie->navigate( 'https://internal.site.of.doom/' ); sleep(4); if(!defined $ie->Document()) { print STDERR "Nope that failed as well"); } else { print "We have something back!\n"; }
      This code worked. It opened a browser window to the correct url. Can I do what I need to do just by using OLE?
Re: I need to automate/scrape data from IE
by Corion (Pope) on Dec 07, 2011 at 09:21 UTC
Re: I need to automate/scrape data from IE
by patcat88 (Deacon) on Dec 07, 2011 at 10:22 UTC
    Wireshark then LWP? I know SSL is a pain. There are ways to make SSL systems use "your key"/your cert instead of a random key to talk to the server and then your can decrypt the captured traffic.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://942141]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2014-11-27 21:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (188 votes), past polls