http://www.perlmonks.org?node_id=335679

mkurtis has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a web spider, I am currently trying to make it check itself against a file so it does not visit the same page twice. Here is what I have: (and perltidied too :))
#!/usr/bin/perl -w use strict; use diagnostics; use LWP::RobotUA; use URI::URL; #use HTML::Parser (); use HTML::SimpleLinkExtor; my $a = 0; my $links; my $visited; my $base; my $u; for ( $u = 1 ; $u < 1000000000 ; $u++ ) { open( FILE1, "</var/www/links/file$u.txt" ) || die; while (<FILE1>) { my $ua = LWP::RobotUA->new( 'theusefulbot', 'bot@theusefulnet. +com' ); #my $p = HTML::Parser->new(); $ua->delay( 10 / 6000 ); my $content = $ua->get($_)->content; #my $text = $p->parse($content)->parse; open( VISITED, ">>/var/www/links/visited.txt" ) || die; print VISITED "$_\n"; close(VISITED); open( VISITED, "</var/www/links/visited.txt" ) || die; my $extor = HTML::SimpleLinkExtor->new($base); $extor->parse($content); my @links = $extor->a; $u++; open( FILE2, ">/var/www/links/file$u.txt" ) || die; foreach $links (@links) { my @visited = <VISITED>; foreach $visited (@visited) { if ( $visited eq $links ) { print "Duplicate found"; } else { open( OUTPUT, ">/var/www/data/$a.txt" ) || die; print OUTPUT "$_\n\n"; print OUTPUT "$content"; close(OUTPUT); print FILE2 url("$links")->abs("$_"); print FILE2 "\n"; } } } $a++; $u--; } close(FILE1); close(FILE2); close(VISITED); print "File #: $a\n"; }
This still lets duplicate files exist, I know people have told me to use an array, but that would get rather large, so I'm just using a file, if you know exactly how to do it with an array, then that would be fine, so far I've gotten only "use shift".

Thanks

Replies are listed 'Best First'.
Re: Cutting Out Previously Visited Web Pages in A Web Spider
by kappa (Chaplain) on Mar 11, 2004 at 13:09 UTC
    Uh. You wanna keep two lists: one full of URLs queued for crawling and the other with those you successfully visited (this one will be searched on each iteration so let it be hash). So the logic is:
    sub crawl { my @queue = @_; my %visited; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url); # do useful things with $content push @queue, $link_extractor->links($content); $visited{$url} = 1; } }

    That's all. When size and efficiency start to really matter you will evaluate migrating data to something like Cache::Cache or Berkeley DB.

      Define $http_ua and $link_extractor and above code will work.
        But where exactly do I put that though, what portions of the code do I replace exactly?

        Thanks kappa

      I'm sorry, I don't understand, where exactly do I place your code? I am not sure how to implement your code into the crawler.

      Thanks for your post

Re: Cutting Out Previously Visited Web Pages in A Web Spider
by BazB (Priest) on Mar 14, 2004 at 23:09 UTC

    Here's my pretty simple try (without using files and just for a single page). Adjust to taste:

    #!/usr/bin/perl use strict; use warnings; use LWP::RobotUA; use HTML::SimpleLinkExtor; sub grab_links { my ( $ua, $url ) = @_; my @links; my $response = $ua->get($url); if ($response->is_success) { my $extor = HTML::SimpleLinkExtor->new(); my $content = $response->content; $extor->parse($content); @links = $extor->a; # get a ref links. Check docs - these are re +lative paths. } else { die $response->status_line; } return @links; } my $visit = $ARGV[0]; my $ua = LWP::RobotUA->new('my-robot/0.1', 'me@foo.com'); # Change th +is to suit. $ua->delay( 0.1 ); # hit every 1/10 second my @links = grab_links($ua, $visit); my %uniq; foreach ( @links ) { $uniq{$_}++; } print "Visited: ", $visit, " found these links:\n", join( "\n", keys % +uniq), "\n";

    Update: this code was put here after talking to mkurtis in the CB. It appears to do most of the things mkurtis is after, so I posted it for future reference.
    Most of the code was taken straight from the docs for HTML::SimpleLinkExtor, LWP::RobotUA and LWP::UserAgent.

    This is the first time I've used any of those modules and it was quite cool :-)


    If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
    That way everyone learns.

      Thanks kappa and BazB, Would you by any chance know how to use URI::URL to make all links absolute. I believe the syntax in this case would be

      url("links from extor")->abs("$url");
      But I don't see how to put this in the do loop.

      Because CPAN isn't searching, heres the URI link: URI::URL

      thanks a bunch

        There's no need to use URI::URL. Change the HTML::SimpleLinkExtor code to return absolute instead of relative paths.

        There is a comment in my code in my last reply noting that my code currently returns relative paths and pointing you to the module docs. RTFM :-)


        If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
        That way everyone learns.

Re: Cutting Out Previously Visited Web Pages in A Web Spider
by kappa (Chaplain) on Mar 14, 2004 at 22:35 UTC
    mkurtis, I tried to mimic the behaviour you seem to expect. Try this.

    Updated, for HTML::SimpleLinkExtor returns links only from first parse.

    #/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; my $a = 0; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; open FILE, '>' . ++$a . '.txt'; print FILE $content; close FILE; print qq{Downloaded: "$url"\n}; push @queue, do { my $link_extractor = new HTML::SimpleLinkExtor; $link_extractor->parse($content); $link_extractor->a }; $visited{$url} = 1; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; crawl(@ARGV);
Re: Cutting Out Previously Visited Web Pages in A Web Spider
by TomDLux (Vicar) on Mar 11, 2004 at 02:24 UTC

    Have you looked at the FAQ?

    Have you used the Search?

    The answers are there, probably under the keyword duplicate, probably in reference to arrays and hashes.

    Come back after you have thoroughly searched those, and ask again if there are elements you do not understand

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      Searched for and found How do I avoid inserting duplicate numbers into an Access table?. Read the perlfaq4 which is the same as perldoc -q duplicate. I guess what I don't understand is why my way doesn't work. I am trying to take all the links that I have visited and if any of them are the same as $links, then don't print them to the file that I take urls to be visited out of. I also have no clue on how to do it differently. I don't think that a database approach will work, I already tried. I also read perldoc -f splice, but I don't see how I will know what position the element is at.

      Thanks

        If you are saving info on each page you find to a file then couldn't you just check to see if the file already exists before writing to it??

        I didn't realy understand your code but you could save each url in a hash. Then just check to see if the url already exists in your hash before reading the page agian. The hash would only get as big the number of sites you spider.


        ___________
        Eric Hodges