Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

On being a browser

by rkg (Hermit)
on Sep 29, 2003 at 11:08 UTC ( #294924=perlmeditation: print w/ replies, xml ) Need Help??

Preface

Of course, the best way for one machine to talk to another machine over the web is through some machine-sensible protocol: XML, soap, whatever. That being said, there are times when this option isn't available, forcing you to use http(s) to write an app that mimics a browser.

This meditation describes my recent experiments writing such an app. Your Mileage May Vary.

And of course do make sure such automated apps conform with any Terms of Use of the site you're using.

LWP and WWW:Mechanize vs. OLE

There are many posts around web from folks asking "How doe I use perl to mimic a browser", and folks always answer, "use LWP" or "use WWW::Mechanize". Those are astoundingly great modules for many circumstances, but they also have limitations.

I'd suggest the strengths of LWP and WWW::Mechanize are:
  • No Details Are Hidden: you can work with the request in all of its glory at various levels of detail
  • RTFM: decent (not great) documentation
  • Folks Know Them: one can obtain reasonable good support and advice from PM and google searches
  • Solid Code: the modules are well written
  • OO: Nice structure allows easy overloading and extension
I would suggest the weaknesses of LWP and WWW::Mechanize are:
  • Non-Intuitive Interface: the human wants to use the metaphor of how a human browses the web -- fill out that box, click this button, click that link. LWP and WWW::Mechanize makes the coder think in tems of forms (which fields live in which forms, the true names (vs. the labels) of fields and buttons, etc.) A different metaphor, less WYSIWYG.
  • Checkboxes and Pulldowns: Setting check boxes and pull-down menus with multiple values is not simple.
  • No Browser To Watch. While debugging, you have set up your own mechanism to save pages, to see why your code fails
  • Hard For Beginners: one must to absorb a good deal of documentation (LWP; LWP::UserAgent; HTTP::Request; HTML::Form, etc) to get a reasonably complex app working
  • Speed: WWW::Mechanize seems slow to me, compared to IE
  • HTTPS: Requires futzing with SSLEasy, and sometimes causes problems
I have been experimenting with an app that interfaces with a website: it needs to log in, redirect to a secure site, examine the status of some pages, post multi-page forms full of hidden cookies and javascript, and repeat a handful of times.

After some struggles with LWP and WWW::Mechanize, I finally decided to try OLE.

I thought "surely OLE will break, or be slower, or be harder to implement."

I was pleasantly surprised: for my needs on this project, OLE was easier. Again, Your Mileage May Vary.

I used http://samie.sourceforge.net/ to get me started.

I'd suggest the strengths of OLE for IE (through SAMIE) are:

  • Intuitive Interface: Fill out a box, click a button, follow a link. Less need to deep-dive into the page source.
  • A Browser To Watch. Set the  $IE->{visible}   = 1 , add some time-delays, and find problems by watching.
  • Speed: OLE ran quite speedily for me
  • HTTPS & Cookies: Seamless -- IE handles it
And the weaknesses of SAMIE:
  • Redmond: Requires Win* and IE. Enough said.
  • Overkill: I read a post somewhere noting "instantiating IE to fetch a webpage is like driving your Hummer 30 feet to the end of your driveway to pick up the newspaper."
  • Solidity. I have no data (yet) to support this concern, but I suspect IE/OLE/SAMIE will crump if banged on too quickly or too hard or too many times.
  • All Details Are Hidden: you a running a browser --everything under the hood (cookies, redirects, etc) is invisible
  • Docs: weak documentation
  • Few Folks Know It: less support from the community. Many google searches for "OLE IE object model" or "OLE IE API" lead to posts that just carp, "Jeepers -- isn't it hard to find docs for OLE and IE?". Docs on the MS site are hard to find or outdated.
  • Code: SAMIE has a few bugs, I think. The code logic is deeply nested and it appears certain branches might not have been thoroughly tested.
  • Procedural: Subroutines and deeply nests "if"s... I prefer clear OO myself.

Summary

Perl is about using the right tool the job.

For quick page fetches, I'd use LWP. For simple web apps, I'd use WWW::Mechanize. For testing redirectors or lower-level code, I'd use LWP (so as to be able to see exactly what is going on). For interfacing with a complex multipage secure form quickly on a Win* platform, I'd now suggest considering OLE.

rkg

I found the following links of some help:

update (broquaint): shortened width-bursting URLs

Comment on On being a browser
Download Code
Re: On being a browser
by liz (Monsignor) on Sep 29, 2003 at 11:38 UTC
    All Details Are Hidden: you a running a browser --everything under the hood (cookies, redirects, etc) is invisible...

    I'd worry about this for these reasons:

    • Towards the future: when Microsoft decides it's time to update your browser, will it continue to work? Of course this applies to new versions of WWW::Mechanize as well, but at least that has a test-suite.
    • MSIE is notorious for its security flaws. I would be hesitant about running this on a server platform.
    • Who knows what information about the system MSIE sends to Microsoft. And what it can send "on demand".

    Liz

Re: On being a browser
by Corion (Pope) on Sep 29, 2003 at 11:53 UTC

    I had the same idea, and my approach let me to create WWW::Mechanize::Shell, together with HTML::Display, which allow semi-natural browsing via the command line and still have a browser available to render the resulting pages. It's not perfect and suffers from the same JavaScript pitfalls as the IE solution does, but it works nicely enough to easily get a skeleton script for automating http/web tasks.

    perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
Re: On being a browser
by atcroft (Monsignor) on Sep 29, 2003 at 16:41 UTC

    Another resource of possible interest is the O'Reilly book Web Client Programming with Perl, which (as I understand it) is no longer available in dead-tree format, but is available online as part of the O'Reilly Open Books Project. Perhaps a little dated, but still useful (at a minimum) for getting a basic grasp on what is occurring as a web client, as well as for several introductory examples of using the LWP module.

      That book has been, thankfully, pretty much fully replaced by Sean Burke's excellent Perl & LWP book.

      There's also the new Spidering Hacks which covers Mech quite well along with many, many, practical scraping bits.

      Sure they're not free, but they're damn cheap if you use Safari Online Books (they also have a free 2 week trial available).

Re: On being a browser
by tomhukins (Curate) on Sep 29, 2003 at 17:58 UTC
    I like your approach, especially for specifically testing interoperability with Internet Explorer. For a simpler, pure Perl system, you might want to investigate HTTP::Recorder which sits as an HTTP proxy between your browser and your server and records HTTP requests for WWW::Mechanize to run automatically. (Disclaimer: I haven't used this module and it suffers from sparse documentation at the moment).
Re: On being a browser
by rkg (Hermit) on Sep 30, 2003 at 01:43 UTC
Re: On being a browser
by DapperDan (Pilgrim) on Sep 30, 2003 at 10:37 UTC
    I found your post interesting because I hadn't heard of some of the modules you mention.

    Of course, the best way for one machine to talk to another machine over the web is through some machine-sensible protocol: XML, soap, whatever.

    That statement seems quite ambiguous to me. Are you talking about application protocols (e.g. HTTP, SMTP) or formats used in those protocols (HTML, RFC822)?

    Either way, you may be interested in designing your systems RESTfully. I know I am.

Re: On being a browser
by t'mo (Pilgrim) on Sep 30, 2003 at 17:45 UTC
Re: On being a browser
by Jenda (Abbot) on Oct 01, 2003 at 21:15 UTC

    I'll definitely have a look at your Samie. We do have (in production for more than a year) a similar VB solution, but I'd of course prefer a Perl one.

    What we have is a COM object that wraps the browser (actually up to four browser windows) and provides similar methods like Samie. Then there is a service that loads the data and commands from database, instantiates the COM object and drives it through the pages. At times it's pretty slow, especialy if the site I need to work with is slow. Apart from running several instances of the service I don't see a way to work on several sites at once.

    Did you try to run several Samies in several threads? How do you handle popup windows?

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

    Edit by castaway: Closed small tag in signature

Mechanize docs
by petdance (Parson) on Oct 04, 2003 at 01:59 UTC
    What can I do to improve the Mechanize docs? I'm open to suggestions. I'm thinking of putting together a tutorial. There are also three hacks/examples in Web Spidering Hacks.

    xoxo,
    Andy

      To make the module more friendly to beginners, the documentation might offer examples of potentially tricky situations (for beginners)...
      • examples of how to use checkboxes & radio buttons
      • examples of how to use use pull down single-select menus
      • examples of how to use multiple-select menus
      • examples of https and https-with-post
      • examples of using WWW::Mechanize::Shell to speed things up
      • examples of wise WWW::Mechanize use (check return codes on every page, check page returned is the page expected, etc)
      • tips for WWW::Mechanize debugging
      As I said in OP, I am a big fan of W::M and W::M::S. They are great modules. However, better docs could allow more folks to benefit from their power and convenience.

      rkg

        Yeah, that would be great (https information). IN fact, if such a thing exists, could someone drop me an email at lewtone@myfastmail.com
Re: On being a browser
by henrywasserman (Initiate) on Oct 21, 2003 at 04:29 UTC
    As far as the overkill argument is concerned let me magnify the argument, no you should not use a front end loader to get sugar from the sugar bowl. Of-course use a spoon. But if you are trying to dig outdoors then by all means use the front end loader, just make sure your not digging near any gas lines.

    samie should be used when you want to test how the front and of an app server is behaving for the end user, remember them? The user that is. Sometimes there can be millions of them, but only if the browser is behaving rather intuitively.

    samie should be used to make sure that the other side of the server is behaving itself properly. The client side. And what better application to test with than Internet Explorer itself since that is what about 90% of the people are using right now. (granted this will probably change over time, but there is no time like the present 10/21/2003

    People who have tried the expensive browser automation tools know exactly what I'm talking about. I'm talking to the winrunner, robotj, rational robot, visual test, MS Test, SilkTest, qa-partner gurus out there. samie was written for this small group of qa test engineers....

    Now solidity. samie has solidity. It's as rock solid as perl itself, since it's written in 100% active perl. And because it's written in perl it's as fast as perl. It operates about 10 times faster than it $5000 winrunner competition. IE doesn't slump along like it does under the weight of the other expensive automation tools (Silktest, winrunner or rational robot) waiting for some goofy language interpreter to figure out what should happen next.

    With the elegant design of perl extensions, you have a direct c reference to the dhtml. This is the exact same e engine in real time that is running javascript in the active session of internet explorer.

    samie is simply as fast as javascript because it is using the very same html DOM that Internet Explorer is using during the browser's session.

    More people are getting to know samie every day, there have already been over 640 downloads.

    samie is so short and simple the only bugs it contains are the ones you write for it yourself.

    4 have been found in the last 1 1/2 years and none in the last year.

    If you do find any though, write them up and I'll be glad to fix them.

    I have used samie for two different companies and it has stood up quite well for both. It's doesn't leak memory, and it runs blinding fast, because it's all perl.

    Download it and enjoy - it's free and always will be.

    I've been blowing my own horn for quite a while now.

    But what do I got to lose?

    Let me know what you think.

    - Henry
      Check out samie's slingshot demo movie. http://samie.sourceforge.net/slingshotmovie.html
      Hello, I am not really a programmer, but can write perl/tcl/c\++, and so on.
      My task is to automate logging on to a few httpS websites (cookies, redirects and all), and to scrape the screens, putting the data (lots of it) into a excel spreadsheet. Automating the spreadsheet manipulation of the data, and then putting the resultant data back on a site through mimicking a human user's interaction with a site.

      I have been researching the tools to do that, and am finally completely confused:
      I tried perl LWP/Mechanize, and got stuck after one of the pages i need to talk to says my browser does not support javascript and cascading frames.
      I then thought that a better way to do the job would be to automate IE itself since I know it is supported, plus I can see what's going on.

      I now found a bunch of tools similar to samie, and my question really is what is the right tool for the job? Do i not know enough about perl to make it work with JavaScript, why not use VB to automate IE (a lot of people seem to be doing that), and can perl let me pull up excel, and run macros in it like VB would.

      I know this is alot of questions (reflecting of my state of confusion) but any comments are much appreciated, thanks in advance...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://294924]
Approved by broquaint
Front-paged by duelafn
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (23)
As of 2014-07-14 13:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (260 votes), past polls