Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Normalizing URLs

by Anonymous Monk
on Jul 21, 2005 at 14:29 UTC ( #476849=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Is there a perl module or similar that lets me normalize an URL. I want to do what is decribed here http://www.xml.com/pub/a/2004/08/18/pilgrim.html without reinventing the wheel. CPAN does not seem to have anything. Cheers.

Replies are listed 'Best First'.
Re: Normalizing URLs
by ikegami (Pope) on Jul 21, 2005 at 14:43 UTC

    Keep in mind the point of the article: It's impossible to do URL normalizing well. They've even missed two items that need normalizing (but they might be in the linked spec):

    1) The order of the arguments in a GET:
    .../script.cgi?a=b&c=d vs
    .../script.cgi?c=d&a=b

    2) The domain name:
    example.com vs
    example.com. vs
    EXAMPLE.COM

    Oh and IP addresses too:
    10.0.0.1 vs
    0x0A000001 vs
    167772161

      Some background - I am developing a scraper and it needs to know if it has scraped the page already - hence the normalization. It needn't be perfect, just good enough for all but the most arcane. Needs only work with http.

        What about using the "last_modified" method in LWP? Keep track of it locally. When you access the page again, check the time it was modified and skip it if that time is not newer than what you've saved.

        This idea is from "Spidering Hacks" (hack #16).

Re: Normalizing URLs
by derby (Abbot) on Jul 21, 2005 at 15:00 UTC

    I haven't tried it but wouldn't URI and it canonical and eq methods work for you?

    Update: Looks like URI will not normalize query params. Something like this should work (note, I did not check all cases - feel free to fix!)

    !/usr/bin/perl -wd use URI; my $u1 = URI->new("http://www.perl.com/cgi-bin/script.cgi?a=b&c=d"); my $u2 = URI->new("http://www.perl.com/cgi-bin/script.cgi?c=d&a=b"); my $u1c = $u1->canonical; my $u2c = $u2->canonical; if( urlsEqual( $u1c, $u2c ) ) { print "equal\n"; } else { print "not equal\n"; } sub urlsEqual { my( $u1, $u2 ) = @_; my( $q1, $q2 ); # First try URI eq return 1 if( $u1->eq( $u2 ) ); # nope ... adjust query $q1 = $u1->query(); $q2 = $u2->query(); $q1 = join( '&', sort( split( /[&;]/, $q1 ) ) ) if $q1; $q2 = join( '&', sort( split( /[&;]/, $q2 ) ) ) if $q2; $u1->query( $q1 ); $u2->query( $q2 ); return $u1->eq( $u2 ); }

    -derby

      From what I saw, URI

      • Lowercases the scheme.
      • Lowercases the domain name. (1)
      • Removes the port if it's the default. (2)
      • Removes port fields consisting of just ':'. (3)
      • Adds trailing '/' if no path or query is specified. (6, partial)

      • Doesn't do (4), (5), (7) and (8), but easy to do.
      • Doesn't do (9) and (10), but might not be possible.
      • Doesn't set the path to '/' if no path is specified and a query is specified. (6, partial)
      • Doesn't normalize IP addresses in to dotted form.
      • Doesn't remove the trailing '.' from domain names, if any.
      • Doesn't touch the query.
      You can't expect a module called 'URI' to normalize CGI parameters. http://foo.com/bar?a=b&c=d and http://foo.com/bar?c=d&a=b are two different URIs. The fact the two different URIs are treated the same by the receiving server is outside of the URI realm.
Re: Normalizing URLs
by gam3 (Curate) on Jul 21, 2005 at 17:58 UTC
    There can be problems in normallizing URIs. I have found several sites that need a `|' in the url even though the %nn is put in it's place by URI.

    -- gam3
    A picture is worth a thousand words, but takes 200K.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://476849]
Approved by Tanalis
Front-paged by GrandFather
help
Chatterbox?
[SuicideJunkie]: Just wait; someday soon, you'll be given a DB with unicode emojis in the column names.
[Corion]: marinersk: Well, I have done select statements like select sum(foo) as "Total Amount", ..., but to have a table like that makes me shudder
[Corion]: SuicideJunkie: :-D
[marinersk]: SuicideJunkie LOL
[choroba]: Woohoo! Fixed a test that hasn't run for 3 years.
[marinersk]: Corion Yes, sometimes whitespace in column headers is acceptable, but I still consider it be less than desireable if that query might get revectored for an ETL-esque process...
[marinersk]: choroba++
[choroba]: it's a long running test, so it's normally skipped unless an env var is set
[choroba]: nobody has been bothered to set the variable in the last 3 years

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (10)
As of 2017-05-25 15:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?