Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider

by BazB (Priest)
on Mar 15, 2004 at 08:55 UTC ( #336641=note: print w/ replies, xml ) Need Help??


in reply to Re: Cutting Out Previously Visited Web Pages in A Web Spider
in thread Cutting Out Previously Visited Web Pages in A Web Spider

There's no need to use URI::URL. Change the HTML::SimpleLinkExtor code to return absolute instead of relative paths.

There is a comment in my code in my last reply noting that my code currently returns relative paths and pointing you to the module docs. RTFM :-)


If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
That way everyone learns.


Comment on Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider
by mkurtis (Scribe) on Mar 18, 2004 at 03:52 UTC
    I have been trying to build in your error checking, as well as tie it into a DBM for when it crashes. Heres what I got:
    #!/usr/bin/perl -w use strict; use warnings; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; my $a = 42464; my $base; dbmopen(%visited, "visited", 0666); while ( my $url = shift @queue ) { next if $visited{$url}; my $response = $http_ua->get($url); if ($response->is_success) { my $content = $response->content open FILE, '>' . ++$a . '.txt'; print FILE "$url\n"; print FILE $content; close FILE; print qq{Downloaded: "$url"\n}; push @queue, do { my $link_extractor = HTML::SimpleLinkExtor->new($u +rl); $link_extractor->parse($content); $link_extractor->a; }; $visited{$url} = 1; } else { dbmclose(%visited); die $response->status_line; } } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $http_ua->delay( 10 / 6000 ); crawl(@ARGV);
    This gives me: Can't call method "is_success" without a package or object reference at theusefulbot.pl line 21.

    Any Ideas?

    Thank you

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://336641]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2015-07-04 14:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls