Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Avoid "duplicate" fetching with LWP

by Anonymous Monk
on Apr 22, 2003 at 16:40 UTC ( [id://252312]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm fetching links from a webpage via LWP, LWP::UserAgent, and HTML::LinkExtor, and I've run into something I can't figure out.

How do I avoid fetching "duplicate" links, which are actually fragments at the end of a valid page in my queue? Is there a way to tell that the following three links are the same page, minus the fragment?

http://www.foo.bar/index.html http://www.foo.bar/index.html#foo http://www.foo.bar/index.html#bar
I pass these into my @links array, and remove the dupes with the following:
my @pri = grep {!$seen{$_} ++} @links;

The problem is that the "uniqueness" is on a stringification level, not at the URI level, so fragments which differ make the URL seen as unique. I'd much rather prefer not to fetch the same page 20 times for a link which appears once, with 20 fragments on it.

Should I split on the '#' there, and fetch everything to the left of it?

What if someone decides to put the '#' in a query string? Is that possible?

Replies are listed 'Best First'.
•Re: Avoid "duplicate" fetching with LWP
by merlyn (Sage) on Apr 22, 2003 at 16:52 UTC
Re: Avoid "duplicate" fetching with LWP
by hmerrill (Friar) on Apr 22, 2003 at 17:05 UTC
    I would do just what you propose - split on the '#' and fetch everything to the left. That is, split on the '#' receiving into a list, and then fetch on the 1st element of the list, like:
    for $link (@links) { @link_tokens = split("#", $link); push(@pri, $link_tokens[0]); }
    that should handle those rare cases where someone puts a '#' sign in the query string - I think(?) you only care about the part of the link before the 1st '#' sign, right?

    HTH.
Re: Avoid "duplicate" fetching with LWP
by perlguy (Deacon) on Apr 22, 2003 at 18:08 UTC

    How about:

    use Data::Dumper; my @uris = qw( http://www.foo.bar/index.html http://www.foo.bar/index.html#foo http://www.foo.bar/index.html#bar ); my %seen; my @unique_uris = grep !$seen{$_}++, map /^([^?#]+)/, @uris; print Dumper(\@unique_uris);

    That would catch everything to the left of a # (anchor) and ? (query) character (if there is one), which I believe is what you want.

    Hope that helps.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://252312]
Approved by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-03-19 03:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found