Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Using URI::URL to determine resources within a site

by sutch (Curate)
on Aug 04, 2004 at 03:39 UTC ( #379893=perlquestion: print w/ replies, xml ) Need Help??
sutch has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to know options for determining if a URL is part of a website using URI::URL, or any other perl module.

For example, if the "site" that I'm interested in is http://www.comp.leeds.ac.uk/Perl/, and my script encounters the URL http://www.comp.leeds.ac.uk/Perl/filehandling.html, is there some method that returns a true value signifying that the URL is part of the site? I regular expression seems to be the way to go, but I'd guess that someone has already done this and published a module that handles the tricky stuff, such as case and whatever else I haven't yet thought about.

Comment on Using URI::URL to determine resources within a site
Re: Using URI::URL to determine resources within a site
by mifflin (Curate) on Aug 04, 2004 at 05:24 UTC
    Is this what you are looking for?
    When the following is run...
    use URI; $u1 = URI->new('http://www.comp.leeds.ac.uk/Perl/'); $u2 = URI->new('http://www.comp.leeds.ac.uk/Perl/filehandling.html'); print "u1.host=",$u1->host(),"\n"; print "u2.host=",$u2->host(),"\n"; if ($u1->host() eq $u2->host()) { print "URL's host are the same\n"; } else { print "oops\n"; }
    It produces the follwing output...
    u1.host=www.comp.leeds.ac.uk u2.host=www.comp.leeds.ac.uk URL's host are the same
    It doesn't use URI::URL, but I'm not sure that you need to.

    Update:
    The URI perldocs have a section on parsing URI's with a regex. Here's a cut of that section...
    PARSING URIs WITH REGEXP As an alternative to this module, the following (official) regular exp +ression can be used to decode a URI: my($scheme, $authority, $path, $query, $fragment) = $uri =~ m|(?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:# +(.*))?|; The URI::Split module provide the function uri_split() as a readable a +lternative.


    Update:
    To get around the problems of case you could use the URI canonical method like...
    use URI; $u1 = URI->new('http://www.comp.Leeds.ac.uk/Perl/')->canonical(); $u2 = URI->new('http://www.comp.leeds.ac.uk/Perl/filehandling.html')-> +canonical(); print "u1.host=",$u1->host(),"\n"; print "u2.host=",$u2->host(),"\n"; if ($u1->host() eq $u2->host()) { print "URL's host are the same\n"; } else { print "oops\n"; }
Re: Using URI::URL to determine resources within a site
by Gilimanjaro (Hermit) on Aug 04, 2004 at 11:00 UTC

    A regular expression can do this, and can handle the case for you:

    $match = $url =~ /^$base/i;

    That's case insensitive...

    But there really is no solution; whether or not the case matters depends on the webserver implementation, and so you can't be sure if you should check for it.

    Other tricky stuff can be a double slash after the host part, or anywhere in the url. Most webservers when mapping requests directly to files on the filesystem will find a file with superfluos slashes, but if everything under a certain location is passed to for instance to a custom mod_perl handler, it's completely up to the handler to decide whether the request is ok...

    It sort of depends on your exact definition of 'site' and thus 'part of the site' I suppose...

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://379893]
Approved by matija
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2014-08-01 04:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (256 votes), past polls