Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Check links.. if they do exist, then print link to file...but how do i check the existance of a link?(validity?)

by dark314 (Sexton)
on Jul 27, 2006 at 18:32 UTC ( [id://564201]=perlquestion: print w/replies, xml ) Need Help??

dark314 has asked for the wisdom of the Perl Monks concerning the following question:

Here is my goal. Note in my code that i'm printing to F, a file. unfortunately this file is very large with alot of links that dont exist in the first place. I want to be able to check the link, and if it exists put it in the file, therefore wasting less time viewing the html file when it is finished. Any ideas? THANK YOU.


NOTE: $lead and $num are merely because there are 3 chars in the possible jpg file, so dont worry about that.

#!/usr/bin/perl $num = 0; open F, ">datafile" or die "Can't open $f : $!"; for ($i = 1;$i<=1000;$i++){ if ($num < 10) { $lead = "00"; } elsif ($num < 100 && $num >= 10) { $lead = "0"; } else { $lead = ''; } print F qq(<img src="http://somewebsite/65081) . $lead . $num . q( +.jpg">) . "\n"; $num++; } close F;
  • Comment on Check links.. if they do exist, then print link to file...but how do i check the existance of a link?(validity?)
  • Download Code

Replies are listed 'Best First'.
Re: Check links.. if they do exist, then print link to file...but how do i check the existance of a link?(validity?)
by davido (Cardinal) on Jul 27, 2006 at 19:32 UTC

    I would probably use LWP::Simple to verify that the link exists.

    use strict; use warnings; use LWP::Simple; my $data_out_file = $ARGV[0] || 'links.txt'; open my $out_handle, '>', $data_out_file or die $!; for my $index ( 0000 .. 1000 ) { my $serial = sprintf "%04s", $index; my $site = qq(http://somewebsite/65081$serial.jpg); sleep 1; next unless head( $site ); my $link = qq(<img src = ") . $site . qq(">); print $out_handle $link, "\n"; } close $out_handle or die $!;

    The head() function grabs the document header, which should tell you if it exists or not. In scalar context it simply returns true if successful.

    Update: Added sleep 1; to throttle how quickly you'll be hitting the site. It now will take your script 1000 seconds to execute (almost 17 minutes) but you will avoid overloading the remote server. ...always best to play nice.


    Dave

Re: Check links.. if they do exist, then print link to file...but how do i check the existance of a link?(validity?)
by Albannach (Monsignor) on Jul 27, 2006 at 19:20 UTC
    For starters, your $lead fiddling would be best replaced by a simple sprintf or perhaps use Perl's automagical incrementing:
    $num = '0000'; for(1..1000) { print ++$num,"\n"; }
    will do what you want with much less fuss.

    To your main question, in order to discover whether these JPGs actually exist, why not just download them since you want to look at them anyway, and for that you should find LWP::Simple's getstore() handy. If you really don't want to download them, then just use the head() function to see what the server will offer for each filename, and act accordingly.

    One more thing, you should really make a habit of using the 3-arg version of open.

    Update: Yes of course, silly me!

    Update 2: In response to rodion's excellent question, I did a little reading and my interpretation is that at least for HTML 1.1, the HEAD should return exactly what a simple GET returns, minus the message body content. That implies (to the optimist in me at least) that the HEAD should not return if the GET would not. In fact, in section 9.4 (page 53 of RFC 2616), it notes that HEAD is often used for testing link validity. It would seem that for compliant servers at least, using HEAD should work. Corrections or clarifications are welcome!

    --
    I'd like to be able to assign to an luser

      $num = '0000'; for(1..1000) { print ++$num,"\n"; }
      Or even
      for my $num ('0000' .. '1000') { ... }
Re: Check links.. if they do exist, then print link to file...but how do i check the existance of a link?(validity?)
by rodion (Chaplain) on Jul 27, 2006 at 19:29 UTC
    The CPAN module LWP::UserAgent provides a method "$ua->max_size( $bytes )" which allows you to gen just a little bit of content, just to make sure the link works.

    If you set this before you issue a "$ua->get( $url )" request, you should be able to check for a "Client-Aborted" header in the response, as per the documentation.

    Setting a maximum size allows you to make sure that you could get the document, without the program having to hang around waiting for the whole transfer. If you have a successful transfer, or a "Client-Aborted" header, you know that the link works and you can quickly move on to checking the next one.

    Addendum: I just saw Albannach's suggestion for getting the header alone. Does anyone reading this know if there are cases (worth checking for) where you can get the header and can't get the content? If no one knows of any that are relevant to the OP, then just checking the header should be faster than getting a limited amount of content.

      I remember downloading from a picture web-site a number of pictures with almost sequentially numbered files. Everytime you hit a "missing" number you got a "This picture does not exist page", so checking for some available content would not have worked. Perhaps only checking for a header could have warned me for the "missing" numbers.

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://564201]
Approved by gellyfish
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2024-04-19 21:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found