Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Robust Handling of Broken Links in Mechanize?

by pat_mc (Pilgrim)
on Nov 20, 2009 at 09:58 UTC ( #808376=perlquestion: print w/ replies, xml ) Need Help??
pat_mc has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks -

I am currently writing a simple script based on the Mechnize module to download a bunch of PDFs from a website. The problem I am encountering is that the site seems to contain a link which points to a document that is unavailable on the server. As a result, the GET method returns the following error:

Error GETing <URL>: Not Found at extract.pl line 86

I have tried to incorporate robust handling of such error messages like this:
sub download() { my $doc = shift @_; my $mech = WWW::Mechanize -> new(); return unless defined( $mech -> get( $doc ) ); my $link = $mech -> find_link( url_regex => qr/\.pdf/ ); return unless defined( $link ); $link = $link -> url_abs; return unless ( $mech -> get ( $link ) ); # This is the GET oper +ation which fails. my $name = $1 if $link =~/.+\/(.+\.pdf)/; $mech -> save_content( $name ); }
Unfortunately, the script still aborts as soon as the broken link is encountered.

Can you please advise how I can modify my code in order for the script to continue downloading even when broken links are hit upon?

Thanks in advance and best regards -

Pat

Comment on Robust Handling of Broken Links in Mechanize?
Download Code
Re: Robust Handling of Broken Links in Mechanize?
by Wolfgang (Novice) on Nov 20, 2009 at 11:25 UTC
    The WWW::Mechanize documentation at CPAN explains the option 'onerror'. Try using it, it may help ;-) Since yo did not set any options, Mechanize falls back to the standards it's author found most helpful. Wolfgang
      Wolfgang -

      This is great stuff ... it looks like this fixes the problem:
      sub download() { my $doc = shift @_; my $mech = WWW::Mechanize -> new( onerror => undef ); return unless defined( $mech -> get( $doc ) ); my $link = $mech -> find_link( url_regex => qr/\.pdf/ ); return unless defined( $link ); $link = $link -> url_abs; return unless ( $mech -> get ( $link ) ); # This is the GET oper +ation which fails. my $name = $1 if $link =~/.+\/(.+\.pdf)/; $mech -> save_content( $name ); }
      Thanks for your help! It made my day.

      Cheers -

      Pat
Re: Robust Handling of Broken Links in Mechanize?
by vishi83 (Pilgrim) on Nov 20, 2009 at 12:09 UTC
    Hi,

    I had this issue when i had to write a similar script. Mechanize aborts when there is a broken link. Probably you can achieve your task using LWP::Simple and HTML::SimpleLinkExtor.


    Thanks.



    A perl Script without 'strict' is like a House without Roof; Both are not Safe;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://808376]
Approved by rovf
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (14)
As of 2015-07-01 16:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (9 votes), past polls