Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Cutting Out Previously Visited Web Pages in A Web Spider

by BazB (Priest)
on Mar 14, 2004 at 23:09 UTC ( #336560=note: print w/ replies, xml ) Need Help??


in reply to Cutting Out Previously Visited Web Pages in A Web Spider

Here's my pretty simple try (without using files and just for a single page). Adjust to taste:

#!/usr/bin/perl use strict; use warnings; use LWP::RobotUA; use HTML::SimpleLinkExtor; sub grab_links { my ( $ua, $url ) = @_; my @links; my $response = $ua->get($url); if ($response->is_success) { my $extor = HTML::SimpleLinkExtor->new(); my $content = $response->content; $extor->parse($content); @links = $extor->a; # get a ref links. Check docs - these are re +lative paths. } else { die $response->status_line; } return @links; } my $visit = $ARGV[0]; my $ua = LWP::RobotUA->new('my-robot/0.1', 'me@foo.com'); # Change th +is to suit. $ua->delay( 0.1 ); # hit every 1/10 second my @links = grab_links($ua, $visit); my %uniq; foreach ( @links ) { $uniq{$_}++; } print "Visited: ", $visit, " found these links:\n", join( "\n", keys % +uniq), "\n";

Update: this code was put here after talking to mkurtis in the CB. It appears to do most of the things mkurtis is after, so I posted it for future reference.
Most of the code was taken straight from the docs for HTML::SimpleLinkExtor, LWP::RobotUA and LWP::UserAgent.

This is the first time I've used any of those modules and it was quite cool :-)


If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
That way everyone learns.


Comment on Re: Cutting Out Previously Visited Web Pages in A Web Spider
Download Code
Replies are listed 'Best First'.
Re: Cutting Out Previously Visited Web Pages in A Web Spider
by mkurtis (Scribe) on Mar 15, 2004 at 03:58 UTC
    Thanks kappa and BazB, Would you by any chance know how to use URI::URL to make all links absolute. I believe the syntax in this case would be

    url("links from extor")->abs("$url");
    But I don't see how to put this in the do loop.

    Because CPAN isn't searching, heres the URI link: URI::URL

    thanks a bunch

      There's no need to use URI::URL. Change the HTML::SimpleLinkExtor code to return absolute instead of relative paths.

      There is a comment in my code in my last reply noting that my code currently returns relative paths and pointing you to the module docs. RTFM :-)


      If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
      That way everyone learns.

        I have been trying to build in your error checking, as well as tie it into a DBM for when it crashes. Heres what I got:
        #!/usr/bin/perl -w use strict; use warnings; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; my $a = 42464; my $base; dbmopen(%visited, "visited", 0666); while ( my $url = shift @queue ) { next if $visited{$url}; my $response = $http_ua->get($url); if ($response->is_success) { my $content = $response->content open FILE, '>' . ++$a . '.txt'; print FILE "$url\n"; print FILE $content; close FILE; print qq{Downloaded: "$url"\n}; push @queue, do { my $link_extractor = HTML::SimpleLinkExtor->new($u +rl); $link_extractor->parse($content); $link_extractor->a; }; $visited{$url} = 1; } else { dbmclose(%visited); die $response->status_line; } } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $http_ua->delay( 10 / 6000 ); crawl(@ARGV);
        This gives me: Can't call method "is_success" without a package or object reference at theusefulbot.pl line 21.

        Any Ideas?

        Thank you

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://336560]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (11)
As of 2015-07-29 17:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (266 votes), past polls