Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Robust Handling of Broken Links in Mechanize?

by pat_mc (Pilgrim)
on Nov 20, 2009 at 09:58 UTC ( #808376=perlquestion: print w/ replies, xml ) Need Help??
pat_mc has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks -

I am currently writing a simple script based on the Mechnize module to download a bunch of PDFs from a website. The problem I am encountering is that the site seems to contain a link which points to a document that is unavailable on the server. As a result, the GET method returns the following error:

Error GETing <URL>: Not Found at extract.pl line 86

I have tried to incorporate robust handling of such error messages like this:
sub download() { my $doc = shift @_; my $mech = WWW::Mechanize -> new(); return unless defined( $mech -> get( $doc ) ); my $link = $mech -> find_link( url_regex => qr/\.pdf/ ); return unless defined( $link ); $link = $link -> url_abs; return unless ( $mech -> get ( $link ) ); # This is the GET oper +ation which fails. my $name = $1 if $link =~/.+\/(.+\.pdf)/; $mech -> save_content( $name ); }
Unfortunately, the script still aborts as soon as the broken link is encountered.

Can you please advise how I can modify my code in order for the script to continue downloading even when broken links are hit upon?

Thanks in advance and best regards -

Pat

Comment on Robust Handling of Broken Links in Mechanize?
Download Code
Re: Robust Handling of Broken Links in Mechanize?
by Wolfgang (Novice) on Nov 20, 2009 at 11:25 UTC
    The WWW::Mechanize documentation at CPAN explains the option 'onerror'. Try using it, it may help ;-) Since yo did not set any options, Mechanize falls back to the standards it's author found most helpful. Wolfgang
      Wolfgang -

      This is great stuff ... it looks like this fixes the problem:
      sub download() { my $doc = shift @_; my $mech = WWW::Mechanize -> new( onerror => undef ); return unless defined( $mech -> get( $doc ) ); my $link = $mech -> find_link( url_regex => qr/\.pdf/ ); return unless defined( $link ); $link = $link -> url_abs; return unless ( $mech -> get ( $link ) ); # This is the GET oper +ation which fails. my $name = $1 if $link =~/.+\/(.+\.pdf)/; $mech -> save_content( $name ); }
      Thanks for your help! It made my day.

      Cheers -

      Pat
Re: Robust Handling of Broken Links in Mechanize?
by vishi83 (Pilgrim) on Nov 20, 2009 at 12:09 UTC
    Hi,

    I had this issue when i had to write a similar script. Mechanize aborts when there is a broken link. Probably you can achieve your task using LWP::Simple and HTML::SimpleLinkExtor.


    Thanks.



    A perl Script without 'strict' is like a House without Roof; Both are not Safe;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://808376]
Approved by rovf
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (14)
As of 2014-08-27 13:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (238 votes), past polls