Cutting Out Previously Visited Web Pages in A Web Spider

mkurtis has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a web spider, I am currently trying to make it check itself against a file so it does not visit the same page twice. Here is what I have: (and perltidied too :))

#!/usr/bin/perl -w

use strict;
use diagnostics;
use LWP::RobotUA;
use URI::URL;

#use HTML::Parser ();
use HTML::SimpleLinkExtor;

my $a = 0;
my $links;
my $visited;
my $base;
my $u;
for ( $u = 1 ; $u < 1000000000 ; $u++ ) {
    open( FILE1, "</var/www/links/file$u.txt" ) || die;
    while (<FILE1>) {
        my $ua = LWP::RobotUA->new( 'theusefulbot', 'bot@theusefulnet.
+com' );

        #my $p = HTML::Parser->new();
        $ua->delay( 10 / 6000 );
        my $content = $ua->get($_)->content;

        #my $text = $p->parse($content)->parse;
        open( VISITED, ">>/var/www/links/visited.txt" ) || die;
        print VISITED "$_\n";
        close(VISITED);
        open( VISITED, "</var/www/links/visited.txt" ) || die;
        my $extor = HTML::SimpleLinkExtor->new($base);
        $extor->parse($content);
        my @links = $extor->a;
        $u++;
        open( FILE2, ">/var/www/links/file$u.txt" ) || die;

        foreach $links (@links) {
            my @visited = <VISITED>;
            foreach $visited (@visited) {
                if ( $visited eq $links ) {
                    print "Duplicate found";
                }
                else {
                    open( OUTPUT, ">/var/www/data/$a.txt" ) || die;
                    print OUTPUT "$_\n\n";
                    print OUTPUT "$content";
                    close(OUTPUT);
                    print FILE2 url("$links")->abs("$_");
                    print FILE2 "\n";
                }
            }
        }

        $a++;
        $u--;
    }
    close(FILE1);
    close(FILE2);
    close(VISITED);
    print "File #: $a\n";
}
[download]

This still lets duplicate files exist, I know people have told me to use an array, but that would get rather large, so I'm just using a file, if you know exactly how to do it with an array, then that would be fine, so far I've gotten only "use shift".

Thanks

Comment on Cutting Out Previously Visited Web Pages in A Web Spider Download Code

Replies are listed 'Best First'.
Re: Cutting Out Previously Visited Web Pages in A Web Spider by kappa (Chaplain) on Mar 11, 2004 at 13:09 UTC
Uh. You wanna keep two lists: one full of URLs queued for crawling and the other with those you successfully visited (this one will be searched on each iteration so let it be hash). So the logic is: `sub crawl { my @queue = @_; my %visited; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url); # do useful things with $content push @queue, $link_extractor->links($content); $visited{$url} = 1; } }` [download] That's all. When size and efficiency start to really matter you will evaluate migrating data to something like Cache::Cache or Berkeley DB.	[reply] [d/l]
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by kappa (Chaplain) on Mar 12, 2004 at 16:59 UTC
Define `$http_ua` and `$link_extractor` and above code will work.	[reply] [d/l] [select]
Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 13, 2004 at 01:32 UTC
But where exactly do I put that though, what portions of the code do I replace exactly? Thanks kappa	[reply]
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by kappa (Chaplain) on Mar 13, 2004 at 11:02 UTC
Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 13, 2004 at 18:05 UTC
Some notes below your chosen depth have not been shown here
Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 12, 2004 at 03:01 UTC
I'm sorry, I don't understand, where exactly do I place your code? I am not sure how to implement your code into the crawler. Thanks for your post	[reply]
Re: Cutting Out Previously Visited Web Pages in A Web Spider by BazB (Priest) on Mar 14, 2004 at 23:09 UTC
Here's my pretty simple try (without using files and just for a single page). Adjust to taste: #!/usr/bin/perl use strict; use warnings; use LWP::RobotUA; use HTML::SimpleLinkExtor; sub grab_links { my ( $ua, $url ) = @_; my @links; my $response = $ua->get($url); if ($response->is_success) { my $extor = HTML::SimpleLinkExtor->new(); my $content = $response->content; $extor->parse($content); @links = $extor->a; # get a ref links. Check docs - these are re +lative paths. } else { die $response->status_line; } return @links; } my $visit = $ARGV[0]; my $ua = LWP::RobotUA->new('my-robot/0.1', 'me@foo.com'); # Change th +is to suit. $ua->delay( 0.1 ); # hit every 1/10 second my @links = grab_links($ua, $visit); my %uniq; foreach ( @links ) { $uniq{$_}++; } print "Visited: ", $visit, " found these links:\n", join( "\n", keys % +uniq), "\n"; [download] Update: this code was put here after talking to mkurtis in the CB. It appears to do most of the things mkurtis is after, so I posted it for future reference. Most of the code was taken straight from the docs for HTML::SimpleLinkExtor, LWP::RobotUA and LWP::UserAgent. This is the first time I've used any of those modules and it was quite cool :-) If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong. That way everyone learns.	[reply] [d/l]
Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 15, 2004 at 03:58 UTC
Thanks kappa and BazB, Would you by any chance know how to use URI::URL to make all links absolute. I believe the syntax in this case would be `url("links from extor")->abs("$url");` [download] But I don't see how to put this in the do loop. Because CPAN isn't searching, heres the URI link: URI::URL thanks a bunch	[reply] [d/l]
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by BazB (Priest) on Mar 15, 2004 at 08:55 UTC
There's no need to use URI::URL. Change the HTML::SimpleLinkExtor code to return absolute instead of relative paths. There is a comment in my code in my last reply noting that my code currently returns relative paths and pointing you to the module docs. RTFM :-) If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong. That way everyone learns.	[reply]
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 18, 2004 at 03:52 UTC
Re: Cutting Out Previously Visited Web Pages in A Web Spider by kappa (Chaplain) on Mar 14, 2004 at 22:35 UTC
mkurtis, I tried to mimic the behaviour you seem to expect. Try this. Updated, for HTML::SimpleLinkExtor returns links only from first parse. #/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; my $a = 0; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; open FILE, '>' . ++$a . '.txt'; print FILE $content; close FILE; print qq{Downloaded: "$url"\n}; push @queue, do { my $link_extractor = new HTML::SimpleLinkExtor; $link_extractor->parse($content); $link_extractor->a }; $visited{$url} = 1; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; crawl(@ARGV); [download]	[reply] [d/l]
Re: Cutting Out Previously Visited Web Pages in A Web Spider by TomDLux (Vicar) on Mar 11, 2004 at 02:24 UTC
Have you looked at the FAQ? Have you used the Search? The answers are there, probably under the keyword duplicate, probably in reference to arrays and hashes. Come back after you have thoroughly searched those, and ask again if there are elements you do not understand -- `TTTATCGGTCGTTATATAGATGTTTGCA`	[reply]
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 11, 2004 at 02:58 UTC
Searched for and found How do I avoid inserting duplicate numbers into an Access table?. Read the perlfaq4 which is the same as perldoc -q duplicate. I guess what I don't understand is why my way doesn't work. I am trying to take all the links that I have visited and if any of them are the same as $links, then don't print them to the file that I take urls to be visited out of. I also have no clue on how to do it differently. I don't think that a database approach will work, I already tried. I also read perldoc -f splice, but I don't see how I will know what position the element is at. Thanks	[reply]
Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider by eric256 (Parson) on Mar 11, 2004 at 03:51 UTC
If you are saving info on each page you find to a file then couldn't you just check to see if the file already exists before writing to it?? I didn't realy understand your code but you could save each url in a hash. Then just check to see if the url already exists in your hash before reading the page agian. The hash would only get as big the number of sites you spider. ___________ Eric Hodges	[reply]
Re: Cutting Out Previously Visited Web Pages in A Web Spider by mkurtis (Scribe) on Mar 11, 2004 at 03:59 UTC

Back to Seekers of Perl Wisdom