Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Checking "incomplete" URLs

by nop (Hermit)
on Feb 18, 2002 at 22:37 UTC ( #146275=perlquestion: print w/ replies, xml ) Need Help??
nop has asked for the wisdom of the Perl Monks concerning the following question:

I have a simple program to check the validity of links before submitting them to search engines (code below.)

This code declares that "incomplete" URLS, like  http://www.mysite.com/mydir are no good -- when, in fact, they work just fine in a browser (and when submited to search engines). The "problem" with these URLs is that the full path all the way down to file isn't explicitly specified, I think.

My question is how do I get LWP useragent to act like a browser and find the default page in a directory?

thanks

nop

package MyUA; use base qw(LWP::UserAgent); use strict; use CGI qw/:standard/; sub redirect_ok {1}; sub new { my $class = shift; my $self = new LWP::UserAgent; bless($self, $class); return $self; } sub validURL { my ($self, $url) = @_; my $req = new HTTP::Request POST => $url; my $res = $self->request($req); my $content = $res->content; return 0 unless $res->is_success; return 0 if $content =~ /the page you have requested cannot be fou +nd/i; return 1; } 1;

Comment on Checking "incomplete" URLs
Select or Download Code
Re: Checking "incomplete" URLs
by rob_au (Abbot) on Feb 18, 2002 at 23:54 UTC
    This is fairly straight-forward to fix - Try changing your validURL subroutine to read thus:

    sub validURL { my ($self, $url) = @_; my $req = new HTTP::Request HEAD => $url; my $res = $self->request($req); my $content = $res->content; return 0 unless $res->is_success; return 0 if $content =~ /the page you have requested cannot be fou +nd/i; return 1; }

    Note that I have changed the request method from POST to HEAD - The POST method will not be allowed for most URLs (thereby generating your false-negative results) and while this could be changed to a GET request, the HEAD request method will be more successful for all "valid" URLs, irrelevant of the preferred request method.

     

    perl -e 's&&rob@cowsnet.com.au&&&split/[@.]/&&s&.com.&_&&&print'

Re: Checking "incomplete" URLs
by BlueLines (Hermit) on Feb 19, 2002 at 02:53 UTC

    My question is how do I get LWP useragent to act like a browser and find the default page in a directory?

    It has nothing to do with your browser, and everything to do with your web server. I tested your example on a site I had control of (running apache). Here's what happened:
    [jon@valium jon]$ telnet divisionbyzero.com 80 Trying 168.103.109.84... Connected to divisionbyzero.com. Escape character is '^]'. GET /decss HTTP/1.0 HTTP/1.1 301 Moved Permanently Date: Tue, 19 Feb 2002 02:47:50 GMT Server: Apache/1.3.22 (Unix) (Red-Hat/Linux) mod_ssl/2.8.5 OpenSSL/0. +9.6b mod_perl/1.24_01 Location: http://www.divisionbyzero.com/decss/ Connection: close Content-Type: text/html; charset=iso-8859-1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> <TITLE>301 Moved Permanently</TITLE> </HEAD><BODY> <H1>Moved Permanently</H1> The document has moved <A HREF="http://www.divisionbyzero.com/decss/"> +here</A>.<P> <HR> <ADDRESS>Apache/1.3.22 Server at www.divisionbyzero.com Port 80</ADDRE +SS> </BODY></HTML> Connection closed by foreign host.
    The web server sent me a 301 since /decss wasn't an actual file, but rather, a directory. My web browser followed that redirect automatically, which is what browsers are supposed to do when the http method used is GET or HEAD. I suspect your troubles are caused because you are using the POST method, which is explicitly forbidden to redirect you without notifying the user.

    BlueLines

    Disclaimer: This post may contain inaccurate information, be habit forming, cause atomic warfare between peaceful countries, speed up male pattern baldness, interfere with your cable reception, exile you from certain third world countries, ruin your marriage, and generally spoil your day. No batteries included, no strings attached, your mileage may vary.
      Hurrah! GET (vs. POST) solved it -- Many thanks, BlueLines! ++
      sub validURL { my ($self, $url) = @_; my $req = new HTTP::Request GET => $url; my $res = $self->request($req); my $content = $res->content; return 0 if $content =~ /the page you have requested cannot be fou +nd/i; return 0 unless $content =~ /\S/i; return 1; }
      By default, LWP::UserAgent automatically follows redirects for any request except a POST. The redirect_ok() method controls this behavior:
      $ua->redirect_ok This method is called by request() before it tries to do any redirects. It should return a true value if a redirect is allowed to be performed. Subclasses might want to override this. The default implementation will return FALSE for POST request and TRUE for all others.
      Recently I had to write a script which posted a form on a remote site, and then checked the text of the resulting page to make sure the post succeeded. Unfortunately, there was a redirect to that page.

      First I tried a making a subclass with a new redirect_ok() that always returned 1. Unfortunately, LWP::UserAgent used a POST request for the redirect; the remote server returned a 405 error. I ended up writing a redirect_ok() which replaced the POST request object in @_ with a new one that did a GET instead. Ugly, but it worked!

        You could upgrade to latest libwww and just use method requests_redirectable from LWP::UserAgent
        $ua->requests_redirectable( ); # to read $ua->requests_redirectable( \@requests ); # to set This reads or sets the object's list of request names that "$ua->redirect_ok(...)" will allow redirection for. By default, this is "['GET', 'HEAD']", as per RFC 2068. To change to include 'POST', consider: push @{ $ua->requests_redirectable }, 'POST';

        --
        Ilya Martynov (http://martynov.org/)

Re: Checking "incomplete" URLs
by Anonymous Monk on Feb 19, 2002 at 04:30 UTC
    Finding the default page is done by the server. not the browser the browser client sends whatever it wants, and the server decides what to send back.
Re: Checking "incomplete" URLs
by erikharrison (Deacon) on Feb 20, 2002 at 05:03 UTC

    When you request a directory on a webserver the server gets to decide what to send you - usually index.html but not necesarily. This is straight ought of lwpcook.pod:

    "If you just want to check if a document is present (i.e. the URL is valid) try to run code that looks like this:

    use LWP::Simple; if (head($url)) { # ok document exists }

    . . ."

    . . . which is the "canonical" way to make sure a url is valid.

    Cheers,
    Erik

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://146275]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2014-09-19 23:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (151 votes), past polls