Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Web Crawler

by mkurtis (Scribe)
on Mar 09, 2004 at 01:24 UTC ( #334989=sourcecode: print w/replies, xml ) Need Help??
Category: Web Stuff
Author/Contact Info mkurtis
Description: This is a webcrawler that will grab links from web addresses in file1.txt and place them in file2.txt and then visit all of those links and place them in file3.txt and so on. It prints the content of each page in numerated files in a seperate folder.
#!/usr/bin/perl -w

use strict;
use diagnostics;
use LWP::RobotUA;
use URI::URL;
#use HTML::Parser ();
use HTML::SimpleLinkExtor;

my $a=0;
my $i;
my $links;
my $base;
my $u;
for($u=1; $u<1000000000; $u++)  {
open(FILE1,"</var/www/links/file$u.txt");
while(<FILE1>) {
my $ua = LWP::RobotUA->new('theusefulbot', 'bot@theusefulnet.com');
#my $p = HTML::Parser->new();
$ua->delay(10/600);
my $content = $ua->get($_)->content;
#my $text = $p->parse($content)->parse;
open(OUTPUT,">/var/www/data/$a.txt");

print OUTPUT "$content";
close(OUTPUT);
my $extor = HTML::SimpleLinkExtor->new($base);
$extor->parse($content);
my @links = $extor->a;
$u++;
open(FILE2,">/var/www/links/file$u.txt");
foreach $links(@links) {
print FILE2 url("$links")->abs("$_");
print FILE2 "\n";
}

$a++;
$i=$a;
$u--;
}
close(FILE1);
close(FILE2);
}
UPDATE: NEW WORKING CODE Thanks to Kappa for making it check itself against a visited list.

#!/usr/bin/perl -w
use strict;

use LWP::RobotUA;
use HTML::SimpleLinkExtor;
use URI::URL;

use vars qw/$http_ua $link_extractor/;

sub crawl {
    my @queue = @_;
    my %visited;
    my $a = 0;
    my $base;
    while ( my $url = shift @queue ) {
        next if $visited{$url};

        my $content = $http_ua->get($url)->content;

        open FILE, '>' . ++$a . '.txt';
        print FILE "$url\n";
    print FILE $content;
        close FILE;

        print qq{Downloaded: "$url"\n};

        push @queue, do {
            my $link_extractor = HTML::SimpleLinkExtor->new($url);
            $link_extractor->parse($content);
            $link_extractor->a;

        };
        $visited{$url} = 1;
    }
}

$http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com';
$http_ua->delay( 10 / 6000 );

crawl(@ARGV);
Replies are listed 'Best First'.
Re: Web Crawler
by matija (Priest) on Mar 09, 2004 at 08:09 UTC
    Our way, here at the monastery, is to select "comment on", and put comments where everybody can see them (and mod them up or down, as appropriate).

    I think you are getting into needless complication with the files. When I write webcrawlers, I usualy use an array to hold URLs I have yet to download (push them in at the end, shift them off the front end), and a hash to tell me which URLs I've already pushed into the array (NOT the ones I've already downloaded: why have multiple copies of the same URL in the array?).

    Of course, your first question would be, what happens when that array and that has become really, really big (which can happen quite easily on the internet). And the answer is: When that happens, you can either use DB_File to tie both the array and the hash, or you can use a real database.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://334989]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2019-12-07 18:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (162 votes). Check out past polls.

    Notices?