Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

On being an IE browser, revisited

by rkg (Hermit)
on Feb 28, 2004 at 21:38 UTC ( [id://332534]=perlquestion: print w/replies, xml ) Need Help??

rkg has asked for the wisdom of the Perl Monks concerning the following question:

Earlier I discussed the pros and cons of OLE + IE to do web things, vs. Mechanize or LWP.

Screen scraping is horrible, and clearly not the Right Way to access web resources. But sometimes screen scraping is necessary to get things done.

For serious screen scraping (yuck), I'm more convinced OLE + IE is the way to go. Often the pages I need to access involve Javascript, pop-up confirmation menus, file selection windows, and file download windows.

To my understanding, LWP and its derivitatives can't handle this sort of complexity. And to my understanding, samie doesn't yet handle popup dialog screens, upload screens, or file downloads.

In my opinion, there's a need for a solid Win32 app to drive IE in a serious way, to allow access to complex web environments involving non-vanilla pages. I asked a few months ago, but I'll toss out the question again:

Does anyone know of a robust full-featured module for driving IE?

Thanks for suggestions --

rkg

Replies are listed 'Best First'.
Re: On being an IE browser, revisited
by PodMaster (Abbot) on Feb 29, 2004 at 16:12 UTC
    Yes it's hard but, but yes you can "scrape" any website with Mechanize (LWP) as long as you're smart enough (besides knowing the perl apis, you have to know how to learn what the browser is doing, and how it's talking to the server, then emulate -- this involves knowing everything from HTTP to CGI to HTML to Javascript).

    In my opinion, there's a need for a solid Win32 app to drive IE in a serious way...
    I kind of think Mozilla makes a stronger candidate to be the browser used "to allow access to complex web environments involving non-vanilla pages" because besides being open source and well documented, it's also cross platform. Mozilla already makes it easy to scrape websites with the livehttpheaders extension, and I'm sure some people have already automated Mozilla (so it appears). Hey, maybe you could write a plugin like livehttpheaders to talk with a perl program?

    Currently a bunch of wxWigets (formerly wxWindows) programmers have created wxMozilla, a component for embedding Mozilla into any wxWindows application. A whole bunch of wxPerl programmers are interested in seeing this ported to wxPerl, it just might happen one day :)

    Which brings me to what already exists, which is Wx::ActiveX::IE - ActiveX interface for Internet Explorer. Sound useful? I'd says yes. Slap together a little wxPerl program (perhaps using wxGlade), add some event handlers and you're scraping :)

    I asked a few months ago, but I'll toss out the question again
    If you can't find it on CPAN/Sourceforge/Freshmeat, or find it using GOOGLE, chances are it doesn't exist. I'll bet unless someone really interested (like you) takes up the task (hint hint), a few months from now it still won't exist.

    Which brings me back to WWW::Mechanize. Take a look at (Javascript::SpiderMonkey)Re: Passing other variables to start handler in HTML::Parser and Re: Perl/Tk and Javascript. All JavaScript::SpiderMonkey needs is a Document Object Model. As soon as somebody writes one (hint), i'm sure somebody will marry it to WWW::Mechanize somehow :)

    I hope that's enough ideas :)

    update: I just ran accross the PerlXPCOM idea, but there doesn't appear to be any code, darn :(

    Perl & Mozilla PerlXPCOM Perl scripting <script language=“perl”>print “Yay!” if /\w+/; </script>

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

Re: On being an IE browser, revisited
by grinder (Bishop) on Feb 28, 2004 at 22:08 UTC
    Often the pages I need to access involve Javascript, pop-up confirmation menus, file selection windows, and file download windows.

    Have you looked at WWW::Mechanize? If you need to understand the finer points of Javascript/frame/cookie/et cetera interaction, have you looked at HTTP::Proxy?

    What do you really want to do today, drive IE, or download web resources? Aren't you getting the means and the end confused?

      Yes, I've worked with Mechanize quite a bit.

      I like it for interacting with nice vanilla web pages. I don't know how to use Mechanize to handle websites which pop up additional confirmation dialog screens. I'd prefer Mechanize to OLE + IE, but I've unable to drive complex websites with Mechanize. (Could be the problem is me, not Mechanize -- I'm well aware of that!)

      What I really want to is use automated intelligent perl scripts to interact with complex HTTP and HTTPS web apps designed for IE or NN. I'd rather not use OLE + IE -- using IE is a hack; IE has a big memory and processor footprint; IE is Win32-only; IE object docs are cryptic; etc. But (again maybe due to my lack of skill with LWP and Mechanize), OLE seems the most feasible option....

        I was going to suggest one of the GuiTest modules, but that doesn't cover the screenscraping requirement. You'd have to code in capturing to a file. In general, I don't like anything that can't be done cleanly with a command line as well as a GUI, so perhaps it's time to get that pure-Perl javascript support into Mechanize (just kidding! just kidding!).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://332534]
Approved by flyingmoose
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-03-29 15:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found