Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^2: Building a Spidering Application

by pemungkah (Priest)
on Jul 08, 2012 at 17:48 UTC ( #980601=note: print w/ replies, xml ) Need Help??


in reply to Re: Building a Spidering Application
in thread Building a Spidering Application

One tweak I might suggest: use a %seen_url hash to cache the URLs that have already been visited. The values are of course not important; you just want to add each URL as a key so you can do if $seen_url{$next_url} to skip links you've already followed once.

If you use this, then a single @queue array (push URLs from the current page on the back, shift next one to process off the front) will work just fine, as you'll discard anything you've already seen.

It also might be a good idea to not follow links that point off the site (like "search this site" custom search, etc.); a quick check of the host via URI can help with that.

This can still get caught by (for instance) calendar links that are CGI-only and of which there's an infinite supply. Adding support for trapping those is left out here.

I've used the URI::ImpliedBase module to handle sites that use relative links rather than absolute ones; this module automatically converts relative URLs to absolute ones, based on the last absolute one it saw. In the process of writing this script, I exposed a bug in URI::ImpliedBase which I need to fix (it changes the implied base for any absolute URI, so a mailto: breaks every relative URL that follows it...). (Edit: fixed in the 0.08 release, just uploaded to CPAN. The lines that can be removed are marked. URI::ImpliedBase now has an accepted_schemes list that it uses to determine whether to reset the base URI or not.)

use strict; use warnings; use WWW::Mechanize; + use URI::ImpliedBase; use URI; my %visited; my @queue; my $start_url = shift or die "No starting URL supplied"; my $extractor = URI::ImpliedBase->new($start_url); my $local_site = $extractor->host; my $mech = WWW::Mechanize->new(autocheck=>0); push @queue, $start_url; while (@queue) { my $next_url = shift @queue; next unless $next_url; print STDERR $next_url,"\n"; next if $visited{$next_url}; ## Not needed with version 0.08 of URI::ImpliedBase; remove if you + have it my $scheme_checker = URI->new($next_url); next if $scheme_checker->scheme and $scheme_checker->scheme !~ /ht +tp/; ## end of removable code $extractor = URI::ImpliedBase->new($next_url); next if $extractor->host ne $local_site; $mech->get($extractor->as_string); next unless $mech->success; # Unseen, on this site, and we can read it. # Save that we saw it, grab links from it, process this page. $visited{$next_url}++; push @queue, map {$_->url} $mech->links; process($next_url, $mech->content); } sub process { my($url, $page_content) = @_; # Do as you like with the page content here... print $page_content; + }
I tested this on pemungkah.com, which is heavily self-linked with relative URLs, and points to a lot of external sites as well. It crawled it quite nicely.


Comment on Re^2: Building a Spidering Application
Download Code
Re^3: Building a Spidering Application
by roboticus (Canon) on Jul 08, 2012 at 23:01 UTC

    perlmungkah:

    Yes, those are very good improvements. I wish I thought of them when I was originally replying!

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re^3: Building a Spidering Application
by Your Mother (Canon) on Jul 09, 2012 at 15:25 UTC

    You don't need URI::ImpliedBase. WWW::Mechanize::Link objects that Mech uses/returns have a method, url_abs, to cover this. Of course then it's up to the spider to decide if query params are relevant or duplicates or no-ops and, in the hacky world of HTML4.9, if fragments are meaningful (but only JS aware Mech would be able to care here).

      Thanks! I didn't know about that one. Last time I wrote a spider was maybe six years ago, and as I recall it wasn't there then - though I may have just missed it at the time. Handy for LWP folks still, I guess.

        I'm pretty sure you're right about it not being there at that time.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://980601]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (13)
As of 2014-08-20 19:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (123 votes), past polls