Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Avoid "duplicate" fetching with LWP

by Anonymous Monk
on Apr 22, 2003 at 16:40 UTC ( #252312=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm fetching links from a webpage via LWP, LWP::UserAgent, and HTML::LinkExtor, and I've run into something I can't figure out.

How do I avoid fetching "duplicate" links, which are actually fragments at the end of a valid page in my queue? Is there a way to tell that the following three links are the same page, minus the fragment?
I pass these into my @links array, and remove the dupes with the following:
my @pri = grep {!$seen{$_} ++} @links;

The problem is that the "uniqueness" is on a stringification level, not at the URI level, so fragments which differ make the URL seen as unique. I'd much rather prefer not to fetch the same page 20 times for a link which appears once, with 20 fragments on it.

Should I split on the '#' there, and fetch everything to the left of it?

What if someone decides to put the '#' in a query string? Is that possible?

Replies are listed 'Best First'.
•Re: Avoid "duplicate" fetching with LWP
by merlyn (Sage) on Apr 22, 2003 at 16:52 UTC
Re: Avoid "duplicate" fetching with LWP
by hmerrill (Friar) on Apr 22, 2003 at 17:05 UTC
    I would do just what you propose - split on the '#' and fetch everything to the left. That is, split on the '#' receiving into a list, and then fetch on the 1st element of the list, like:
    for $link (@links) { @link_tokens = split("#", $link); push(@pri, $link_tokens[0]); }
    that should handle those rare cases where someone puts a '#' sign in the query string - I think(?) you only care about the part of the link before the 1st '#' sign, right?

Re: Avoid "duplicate" fetching with LWP
by perlguy (Deacon) on Apr 22, 2003 at 18:08 UTC

    How about:

    use Data::Dumper; my @uris = qw( ); my %seen; my @unique_uris = grep !$seen{$_}++, map /^([^?#]+)/, @uris; print Dumper(\@unique_uris);

    That would catch everything to the left of a # (anchor) and ? (query) character (if there is one), which I believe is what you want.

    Hope that helps.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://252312]
Approved by broquaint
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2017-12-13 06:19 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (345 votes). Check out past polls.