Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Avoid "duplicate" fetching with LWP

by Anonymous Monk
on Apr 22, 2003 at 16:40 UTC ( #252312=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm fetching links from a webpage via LWP, LWP::UserAgent, and HTML::LinkExtor, and I've run into something I can't figure out.

How do I avoid fetching "duplicate" links, which are actually fragments at the end of a valid page in my queue? Is there a way to tell that the following three links are the same page, minus the fragment?

http://www.foo.bar/index.html http://www.foo.bar/index.html#foo http://www.foo.bar/index.html#bar
I pass these into my @links array, and remove the dupes with the following:
my @pri = grep {!$seen{$_} ++} @links;

The problem is that the "uniqueness" is on a stringification level, not at the URI level, so fragments which differ make the URL seen as unique. I'd much rather prefer not to fetch the same page 20 times for a link which appears once, with 20 fragments on it.

Should I split on the '#' there, and fetch everything to the left of it?

What if someone decides to put the '#' in a query string? Is that possible?

Comment on Avoid "duplicate" fetching with LWP
Select or Download Code
•Re: Avoid "duplicate" fetching with LWP
by merlyn (Sage) on Apr 22, 2003 at 16:52 UTC
Re: Avoid "duplicate" fetching with LWP
by hmerrill (Friar) on Apr 22, 2003 at 17:05 UTC
    I would do just what you propose - split on the '#' and fetch everything to the left. That is, split on the '#' receiving into a list, and then fetch on the 1st element of the list, like:
    for $link (@links) { @link_tokens = split("#", $link); push(@pri, $link_tokens[0]); }
    that should handle those rare cases where someone puts a '#' sign in the query string - I think(?) you only care about the part of the link before the 1st '#' sign, right?

    HTH.
Re: Avoid "duplicate" fetching with LWP
by perlguy (Deacon) on Apr 22, 2003 at 18:08 UTC

    How about:

    use Data::Dumper; my @uris = qw( http://www.foo.bar/index.html http://www.foo.bar/index.html#foo http://www.foo.bar/index.html#bar ); my %seen; my @unique_uris = grep !$seen{$_}++, map /^([^?#]+)/, @uris; print Dumper(\@unique_uris);

    That would catch everything to the left of a # (anchor) and ? (query) character (if there is one), which I believe is what you want.

    Hope that helps.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://252312]
Approved by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (17)
As of 2014-09-23 17:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (232 votes), past polls