http://www.perlmonks.org?node_id=808376

pat_mc has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks -

I am currently writing a simple script based on the Mechnize module to download a bunch of PDFs from a website. The problem I am encountering is that the site seems to contain a link which points to a document that is unavailable on the server. As a result, the GET method returns the following error:

Error GETing <URL>: Not Found at extract.pl line 86

I have tried to incorporate robust handling of such error messages like this:
sub download() { my $doc = shift @_; my $mech = WWW::Mechanize -> new(); return unless defined( $mech -> get( $doc ) ); my $link = $mech -> find_link( url_regex => qr/\.pdf/ ); return unless defined( $link ); $link = $link -> url_abs; return unless ( $mech -> get ( $link ) ); # This is the GET oper +ation which fails. my $name = $1 if $link =~/.+\/(.+\.pdf)/; $mech -> save_content( $name ); }
Unfortunately, the script still aborts as soon as the broken link is encountered.

Can you please advise how I can modify my code in order for the script to continue downloading even when broken links are hit upon?

Thanks in advance and best regards -

Pat

Replies are listed 'Best First'.
Re: Robust Handling of Broken Links in Mechanize?
by Wolfgang (Novice) on Nov 20, 2009 at 11:25 UTC
    The WWW::Mechanize documentation at CPAN explains the option 'onerror'. Try using it, it may help ;-) Since yo did not set any options, Mechanize falls back to the standards it's author found most helpful. Wolfgang
      Wolfgang -

      This is great stuff ... it looks like this fixes the problem:
      sub download() { my $doc = shift @_; my $mech = WWW::Mechanize -> new( onerror => undef ); return unless defined( $mech -> get( $doc ) ); my $link = $mech -> find_link( url_regex => qr/\.pdf/ ); return unless defined( $link ); $link = $link -> url_abs; return unless ( $mech -> get ( $link ) ); # This is the GET oper +ation which fails. my $name = $1 if $link =~/.+\/(.+\.pdf)/; $mech -> save_content( $name ); }
      Thanks for your help! It made my day.

      Cheers -

      Pat
Re: Robust Handling of Broken Links in Mechanize?
by vishi83 (Pilgrim) on Nov 20, 2009 at 12:09 UTC
    Hi,

    I had this issue when i had to write a similar script. Mechanize aborts when there is a broken link. Probably you can achieve your task using LWP::Simple and HTML::SimpleLinkExtor.


    Thanks.



    A perl Script without 'strict' is like a House without Roof; Both are not Safe;