Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Normalizing URLs

by ikegami (Pope)
on Jul 21, 2005 at 14:43 UTC ( #476857=note: print w/replies, xml ) Need Help??


in reply to Normalizing URLs

Keep in mind the point of the article: It's impossible to do URL normalizing well. They've even missed two items that need normalizing (but they might be in the linked spec):

1) The order of the arguments in a GET:
.../script.cgi?a=b&c=d vs
.../script.cgi?c=d&a=b

2) The domain name:
example.com vs
example.com. vs
EXAMPLE.COM

Oh and IP addresses too:
10.0.0.1 vs
0x0A000001 vs
167772161

Replies are listed 'Best First'.
Re^2: Normalizing URLs
by Anonymous Monk on Jul 21, 2005 at 14:46 UTC
    Some background - I am developing a scraper and it needs to know if it has scraped the page already - hence the normalization. It needn't be perfect, just good enough for all but the most arcane. Needs only work with http.

      What about using the "last_modified" method in LWP? Keep track of it locally. When you access the page again, check the time it was modified and skip it if that time is not newer than what you've saved.

      This idea is from "Spidering Hacks" (hack #16).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://476857]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2020-05-27 02:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If programming languages were movie genres, Perl would be:















    Results (152 votes). Check out past polls.

    Notices?