Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Normalizing URLs

by Anonymous Monk
on Jul 21, 2005 at 14:29 UTC ( #476849=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Is there a perl module or similar that lets me normalize an URL. I want to do what is decribed here http://www.xml.com/pub/a/2004/08/18/pilgrim.html without reinventing the wheel. CPAN does not seem to have anything. Cheers.

Comment on Normalizing URLs
Re: Normalizing URLs
by ikegami (Pope) on Jul 21, 2005 at 14:43 UTC

    Keep in mind the point of the article: It's impossible to do URL normalizing well. They've even missed two items that need normalizing (but they might be in the linked spec):

    1) The order of the arguments in a GET:
    .../script.cgi?a=b&c=d vs
    .../script.cgi?c=d&a=b

    2) The domain name:
    example.com vs
    example.com. vs
    EXAMPLE.COM

    Oh and IP addresses too:
    10.0.0.1 vs
    0x0A000001 vs
    167772161

      Some background - I am developing a scraper and it needs to know if it has scraped the page already - hence the normalization. It needn't be perfect, just good enough for all but the most arcane. Needs only work with http.

        What about using the "last_modified" method in LWP? Keep track of it locally. When you access the page again, check the time it was modified and skip it if that time is not newer than what you've saved.

        This idea is from "Spidering Hacks" (hack #16).

Re: Normalizing URLs
by derby (Abbot) on Jul 21, 2005 at 15:00 UTC

    I haven't tried it but wouldn't URI and it canonical and eq methods work for you?

    Update: Looks like URI will not normalize query params. Something like this should work (note, I did not check all cases - feel free to fix!)

    !/usr/bin/perl -wd use URI; my $u1 = URI->new("http://www.perl.com/cgi-bin/script.cgi?a=b&c=d"); my $u2 = URI->new("http://www.perl.com/cgi-bin/script.cgi?c=d&a=b"); my $u1c = $u1->canonical; my $u2c = $u2->canonical; if( urlsEqual( $u1c, $u2c ) ) { print "equal\n"; } else { print "not equal\n"; } sub urlsEqual { my( $u1, $u2 ) = @_; my( $q1, $q2 ); # First try URI eq return 1 if( $u1->eq( $u2 ) ); # nope ... adjust query $q1 = $u1->query(); $q2 = $u2->query(); $q1 = join( '&', sort( split( /[&;]/, $q1 ) ) ) if $q1; $q2 = join( '&', sort( split( /[&;]/, $q2 ) ) ) if $q2; $u1->query( $q1 ); $u2->query( $q2 ); return $u1->eq( $u2 ); }

    -derby

      From what I saw, URI

      • Lowercases the scheme.
      • Lowercases the domain name. (1)
      • Removes the port if it's the default. (2)
      • Removes port fields consisting of just ':'. (3)
      • Adds trailing '/' if no path or query is specified. (6, partial)

      • Doesn't do (4), (5), (7) and (8), but easy to do.
      • Doesn't do (9) and (10), but might not be possible.
      • Doesn't set the path to '/' if no path is specified and a query is specified. (6, partial)
      • Doesn't normalize IP addresses in to dotted form.
      • Doesn't remove the trailing '.' from domain names, if any.
      • Doesn't touch the query.
      You can't expect a module called 'URI' to normalize CGI parameters. http://foo.com/bar?a=b&c=d and http://foo.com/bar?c=d&a=b are two different URIs. The fact the two different URIs are treated the same by the receiving server is outside of the URI realm.
Re: Normalizing URLs
by gam3 (Curate) on Jul 21, 2005 at 17:58 UTC
    There can be problems in normallizing URIs. I have found several sites that need a `|' in the url even though the %nn is put in it's place by URI.

    -- gam3
    A picture is worth a thousand words, but takes 200K.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://476849]
Approved by Tanalis
Front-paged by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (12)
As of 2014-07-24 13:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (160 votes), past polls