Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Local::SiteRobot - a simple web crawling module

by rob_au (Abbot)
on Nov 24, 2001 at 17:09 UTC ( #127264=sourcecode: print w/replies, xml ) Need Help??
Category: Web Stuff
Author/Contact Info rob_au
Description: Earlier this month, George_Sherston posted a node, where he submitted code for a site indexer and search engine - I took this code and decided to build upon it for my own site and in evaluating it and other options available, I found HTML::Index. This code offered the ability to create site indexes for both local and remote files (through the use of WWW::SimpleRobot by the same author) - This ability for indexing based upon URL was important to me as a great deal of content on the site is dynamic in nature. This was where my journey hit a stumbling block ... WWW::SimpleRobot didn't work!

So, I set about writing my own simplified robot code which had one and only one function - return a list of crawled URLs from a start URL address.

#!/usr/bin/perl -w use Local::SiteRobot; use strict; my $robot = Local::SiteRobot->new( DEPTH => 10, FOLLOW_REGEX => '^http://www.cowsnet.com', URLS => [ 'http://www.cowsnet.com.au' ] ); my @pages = $robot->crawl; print STDOUT $_, "\n" foreach @pages;

The code I feel is quite self explanatory - /msg me if you have any questions on usage.

package Local::SiteRobot;

use HTML::LinkExtor;
use LWP::Simple;
use URI;
use strict;

sub new {
    my $class = shift;
    my %options = (
        DEPTH           =>  undef,
        FOLLOW_REGEX    =>  '',
        URLS            =>  [],
        VERBOSE         =>  0
    );
    my %args = (%options, @_);
    foreach (keys %args) {
        die "Local::SiteRobot->new : Unknown argument option - $_" unl
+ess exists $options{$_};
    };
    my $self = bless \%args, (ref($class) || $class);
    $self->_verbose("Local::SiteRobot->new : Created new Local::SiteRo
+bot object");
    return $self;
}

sub crawl {
    my $self = shift;
    return undef unless @{$self->{URLS}};
    my @pages;
    foreach my $url (@{$self->{URLS}}) {
        my $uri = URI->new($url);
        next unless $uri->scheme;
        next unless $uri->scheme eq 'http';
        $self->_verbose("Local::SiteRobot->crawl : Crawling from URL "
+, $uri->canonical->as_string);
        push (@pages, $self->_crawl($uri->canonical->as_string));
        $self->_verbose("Local::SiteRobot->crawl : Crawling from URL "
+, $uri->canonical->as_string, " returned ", scalar(@pages), " pages")
+;
    }
    return @pages;
}

sub _crawl {
    my ($self, $url, $depth) = @_;
    my @pages;
    my $uri = URI->new($url);
    $self->_verbose("Local::SiteRobot->_crawl : GET ", $uri->canonical
+->as_string);
    my $html = get($uri->canonical->as_string);
    return unless $html;
    return $uri->canonical->as_string if ((defined $self->{DEPTH}) && 
+($self->{DEPTH} == ($depth || 0)));
    ${$self->{pages}}{$uri->canonical->as_string} = 1;
    push (@pages, $uri->canonical->as_string);
    my $linkextor = HTML::LinkExtor->new(undef, $uri->canonical->as_st
+ring);
    $linkextor->parse($html);
    foreach my $link ($linkextor->links) {
        my ($tag, %attr) = @{$link};
        next unless ($tag eq 'a');
        next unless (defined $attr{'href'});
        my $href = URI->new($attr{'href'});
        next unless ($href->canonical->as_string =~ /$self->{FOLLOW_RE
+GEX}/);
        next if exists ${$self->{pages}}{$href};
        ${$self->{pages}}{$href} = 1;
        push (@pages, $self->_crawl($href, ($depth || 0) + 1));
    }
    return @pages;
}

sub _verbose {
    my $self = shift;
    return unless $self->{VERBOSE};
    print STDERR @_, "\n";
}

1;


__END__
Replies are listed 'Best First'.
Re: Local::SiteRobot - a simple web crawling module
by merlyn (Sage) on Nov 24, 2001 at 21:26 UTC
    This was where my journey hit a stumbling block ... WWW::SimpleRobot didn't work!
    Rather than partially reinvent a wheel, did you try to work with the author to figure out why it "didn't work"? And can you please explain what "didn't work" for you? I've used the module successfully, as have many hundreds of others.

    "forking" a codebase should be considered carefully, otherwise the open source users as a community all lose.

    -- Randal L. Schwartz, Perl hacker

      The problems which I encountered with WWW::SimpleRobot related to the traverse function not returning traversal results via the $object->pages and $object->urls methods - The problem appeared to relate to the shift method by which the author was iterating through the constructed queue undefining the @pages results array before it was returned at the end of the function.

      Better than just reporting this to the author, I have submitted a fix patch which corrects this behaviour by pushing results into a separate array to the queue.

      115a116 > my @results; 150a152 > push (@results, $page); 165,166c167,168 < $self->{pages} = \@pages; < $self->{urls} = [ map { $_->{url} } @pages ]; --- > $self->{pages} = \@results; > $self->{urls} = [ map { $_->{url} } @results ];

      Note that I never meant for my little piece of code to be viewed as a code fork from WWW::SimpleRobot but rather just an additional available option.

       

      Ooohhh, Rob no beer function well without!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://127264]
help
Chatterbox?
[Corion]: Do you want to launch a script and keep the command prompt/console window open?
[Corion]: Do you want to wait for a key press before closing the window?
[LanX]: I want the command line in the history
[tye]: -Mouse
[Corion]: Option a) would mean launching cmd.exe /k c:\path\to\ batchfile- launching-perl- script.cmd. Option b) would be to add pause as the last line of said batch file.
[LanX]: First day after holidays ... and already stressed by the fact that colleagues changed stuff without communication ... apparently I'm the only one trying to fight entropy
[Corion]: LanX: The command is always in the history if you typed it in before. If you didn't type the command into the command line, it will not be there. I think there is doskey which can stuff command lines into the history
LanX damns the cult of CB ;-)
LanX WTF WTF WTF
[LanX]: please forget my last 3 posts

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (13)
As of 2017-03-27 15:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should Pluto Get Its Planethood Back?



    Results (320 votes). Check out past polls.