Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

What Tools Do You Use With WWW::Mechanize

by Limbic~Region (Chancellor)
on Oct 03, 2011 at 14:13 UTC ( #929348=perlquestion: print w/replies, xml ) Need Help??
Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Normally if a site doesn't work with WWW::Mechanize due to JavaScript/Ajax, I reach for WWW::Selenium. This has worked quite well for me in the past (see Using WWW::Selenium To Test Or Automate An Ajax Website). While I am aware of WWW::Scripter, Win32::IE::Mechanize and WWW::Mechanize::Firefox - I have always just reached for one of the two that I have invested the most effort and energy in.

Lately, I have been freelancing and a number of my clients are on linux but still not very computer savvy. Having to start up an Xterm and export a DISPLAY would likely be perceived as clunky software and not win me any repeat business. I considered Running Selenium Headless but that posed its own problems for delivering code to a client. What I needed was a way to make WWW::Mechanize work. I reached for the Firebug Addon to Firefox. The first particular project had 2 hurdles. The first was that clicking on "links" caused the page content to change but the visible URL to stay the same. This was resolved by examining what Firefox was GETing behind the scenes. The next hurdle was that selecting a particular item in a select drop down was sending a JSON request behind the scenes. After much gnashing of teeth, I discovered this little gem

sub post_json { my ($mech, $json, $url) = @_; my $req = HTTP::Request->new(POST => $url); $req->content_type('application/json'); $req->content($json); return $mech->request($req); }

The next project I used Firebug on really had me baffled. The site didn't appear to require Javascript at all and yet I was getting completely different results from WWW::Mechanize and with Firefox. I made sure I was using $mech->agent_alias('Windows Mozilla'); but to no avail. Upon further examining the headers sent between the two, I played a hunch and did the following:

for my $key (keys %ff_header) { $mech->delete_header($key); $mech->add_header($key => $ff_header{$key}); }

It magically started working as expected.

I feel like this is probably old news to most of you and that there are shiny new tools I should be learning. What are they? Do you use certain ones for certain tasks but not others? I realize that some sites will be nearly impossible to automate with WWW::Mechanize without a JavaScript engine and I am fine with that. I am just looking to increase the number of projects I can complete with just mech.

Cheers - L~R

Replies are listed 'Best First'.
Re: What Tools Do You Use With WWW::Mechanize
by cavac (Deacon) on Oct 03, 2011 at 17:12 UTC
    One problem i encountered with pure WWW::Mechanize was Content-Encoding (compression). At least the version i have installed here announces to the server that it accepts gzip compressed content but doesn't.

    Try to use WWW::Mechanize::GZip instead and see if the header problem clears up.

    Don't use '#ff0000':
    use Acme::AutoColor; my $redcolor = RED();
    All colors subject to change without notice.

        I encountered the problem while implementing content compression in Maplat. It happened while while handling forms and downloading dynamically created XML files.

        The client was running on Windows with ActivePerl. There *may* have been i misconfiguration or missing package. But i have since reinstalled the windows development server, so i'll have to recreate the circumstances and try if i can come up with a small test system.

        It may have been a bug in Maplat as well. But then again it only happened with WWW::Mechanize.

        Do you have Compress::Zlib installed? That is - from what i can tell with only a glance at the source - the main difference between WWW::Mechanize and WWW::Mechanize::GZip. The former uses Compress::Zlib when available while the later requires it to be installed.

        That said, i'll try to recreate the problem but it might take some time. I'm currently in the middle of preparing a big server upgrade (nice monks are allowed to upgrade to Postgres 9.x while naughty ones have to keep using flat files).

        Don't use '#ff0000':
        use Acme::AutoColor; my $redcolor = RED();
        All colors subject to change without notice.
Re: What Tools Do You Use With WWW::Mechanize
by OfficeLinebacker (Chaplain) on Oct 03, 2011 at 22:43 UTC
    I'll tell you a technique I probably *should* be using with Mech, which is to either locally cache content or introduce some kind of random delays between loading a page and clicking on a link in it because sometimes servers don't like bots accessing their sites. Also, I copy the UA string verbatim from my browser which works, rather than trying to figure out exactly what parts of it are what I need.

    I like computer programming because it's like Legos for the mind.
      Greetings, esteemed monks!

      To reply to my own concern, I came up with this for generating wait times between link clicking/back() calls in a Mech script. What do you think?

      #!/usr/bin/perl -- use strict; use warnings; my $i1 = int(rand(5)+1); my $i2 = int(rand(2)); my $i = 0; while ($i<10){ print "$i1: $i2\n"; my $interval = $i1 + ($i2*$i1); print "waiting for $interval seconds...\n"; sleep($interval); $i1 = int(rand(5)+1); $i2 = int(rand(2)); $i++; }
      Sample output:
      1: 0 waiting for 1 seconds... 4: 0 waiting for 4 seconds... 1: 1 waiting for 2 seconds... 4: 1 waiting for 8 seconds... 4: 0 waiting for 4 seconds... 4: 1 waiting for 8 seconds... 4: 0 waiting for 4 seconds... 5: 0 waiting for 5 seconds... 1: 1 waiting for 2 seconds... 1: 1 waiting for 2 seconds...

      I like computer programming because it's like Legos for the mind.
        Have you seen WWW::Mechanize::Sleepy? Personally, I use something along the lines of:
        # Sleep a random interval between $duration and 2 * $duration - 1 unit sub rest { my ($duration) = @_; sleep $duration; sleep rand($duration); } sub fetch_page { my ($mech, $action, $target, $max, $duration) = @_; for (1 .. $max) { rest($duration); eval {$mech->$action($target);}; return if ! $@ && $mech->status == OK; } die "Failed to fetch '$url' after '$max' attempts\n"; }

        Of course, if you want to allow for HTTP redirects then you will need to change status == OK to include acceptable HTTP codes. Additionally, if you use Time::HiRes to overload sleep, you can easily sleep for partial minutes. In truth, I typically use milliseconds.

        Cheers - L~R

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://929348]
Approved by Corion
Front-paged by planetscape
and the daffodils sway...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2018-06-21 05:05 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (117 votes). Check out past polls.