Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Checking for an existing URL

by kidd (Curate)
on Sep 29, 2002 at 14:02 UTC ( [id://201541]=perlquestion: print w/replies, xml ) Need Help??

kidd has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks: Im looking for some ideas in how is the best way to do this.

Im making a script that takes an URL as an input and checks for its existance. The current code I came up with is this:

#!/usr/bin/perl -w use strict; use CGI::Carp qw(fatalsToBrowser); use CGI; my $q = new CGI; my $url = $q->params('url'); my $ua = LWP::UserAgent->new; my $request = HTTP::Request->new(GET => $url); my $response = $ua->request($request); my $string = $response->content; $string =~ s/\n//gi; if($string =~ /404 Not Found/ or $string eq ''){ print "$url - Doesn't exist\n"; next; }else{ print "$url - Does exist\n"; next; }

What I do is that I fetch the URL and then check for either the string "404 Not Found" in the $tring or for it to be empty.

The problem with this approach is that there may be a possibilitie that the page has some text that say "404 Not Found" as part of the content but it actually exists.

So I thought that maybe there will be some kind of headers that tell you what kind of response you got...maybe a 302 or a 501, but I got no idea of how can I achieve this.

Thanks

Replies are listed 'Best First'.
Re: Checking for an existing URL
by rob_au (Abbot) on Sep 29, 2002 at 14:11 UTC
    This a lot easier than you think ...

    my $ua = LWP::UserAgent->new; my $request = HTTP::Request->new('GET' => $url); my $response = $ua->request($request) if ($response->is_error) { ... } else { ... }

    Alternatively, you could employ the is_success method for testing for successful retrieval of the passed URL. Furthermore, the actual numeric response code received can be returned with the code method. For further details on HTTP::Response methods, see the HTTP::Response man page.

     

    perl -e 'print+unpack("N",pack("B32","00000000000000000000000111000011")),"\n"'

      Or if you are just after 404s:

      ... if ($response->code()==404) { ...

      ...which is valid because while the payload of a 404, or in fact the text message in the status header could be anything (eg they could be localised) the first three non-space characters in the Status: HTTP header must be the HTTP response code.

      Unfortunately, there is a flaw in any such approach. If you want to probe for the existence of a file (or listening script), you may for example have DNS problems, a bad (or unusable) URI scheme part, a faulty proxy or redirector, problems connecting to the IP, random server problems, not to mention the possibility of a CGI script that sends a 404 response on purpose (necessary for most properly-operating error handler scripts).

      And the other side of it is that you can get 'false' positives from, eg, apache 'handlers', errordocuments and the like, including badly-operating error handlers.

      In short, there's no easy way to do so. Best option, IMO, would be to use $request->is_success(), as mentioned (implicitly) in the message preceding this, to mark 'valid' URLs and consider anything else to be undefined.

Re: Checking for an existing URL
by Jenda (Abbot) on Sep 29, 2002 at 14:57 UTC

    I thing you should try HEAD instad of GET. If all you want to know is whether the URL is fine, why would you download the whole document? :-)

    Another thing. $response->code() returns the HTTP status code. So you do not have to search for anything in the contents. Besides ... it's quite possible that the HTML with the "URL not found" message will NOT contain any 404 at all.

    Jenda

      I thing you should try HEAD instad of GET. If all you want to know is whether the URL is fine, why would you download the whole document? :-)
      Because some servers have been shown to return an error for HEAD, but a real document for GET. So you have to do both. Try HEAD first, but if it fails, retry with GET.

      -- Randal L. Schwartz, Perl hacker

      It is very rare, but merlyn speaks the truth, some webserver's are broken. There is still no reason to download the whole document. Enjoy the fruits of those who RTFM :) LWP head replacement

      poetry ;)

      update: I don't mean it's rare that merlyn speaks the truth, I mean it is rare that a webserver is broken in such a manner, where a HEAD request would fail like so ;)

      ____________________________________________________
      ** The Third rule of perl club is a statement of fact: pod is sexy.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://201541]
Approved by vagnerr
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-03-19 10:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found