Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Cutting Out Previously Visited Web Pages in A Web Spider

by mkurtis (Scribe)
on Mar 15, 2004 at 03:58 UTC ( #336594=note: print w/ replies, xml ) Need Help??


in reply to Re: Cutting Out Previously Visited Web Pages in A Web Spider
in thread Cutting Out Previously Visited Web Pages in A Web Spider

Thanks kappa and BazB, Would you by any chance know how to use URI::URL to make all links absolute. I believe the syntax in this case would be

url("links from extor")->abs("$url");
But I don't see how to put this in the do loop.

Because CPAN isn't searching, heres the URI link: URI::URL

thanks a bunch


Comment on Re: Cutting Out Previously Visited Web Pages in A Web Spider
Download Code
Replies are listed 'Best First'.
Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider
by BazB (Priest) on Mar 15, 2004 at 08:55 UTC

    There's no need to use URI::URL. Change the HTML::SimpleLinkExtor code to return absolute instead of relative paths.

    There is a comment in my code in my last reply noting that my code currently returns relative paths and pointing you to the module docs. RTFM :-)


    If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
    That way everyone learns.

      I have been trying to build in your error checking, as well as tie it into a DBM for when it crashes. Heres what I got:
      #!/usr/bin/perl -w use strict; use warnings; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; sub crawl { my @queue = @_; my %visited; my $a = 42464; my $base; dbmopen(%visited, "visited", 0666); while ( my $url = shift @queue ) { next if $visited{$url}; my $response = $http_ua->get($url); if ($response->is_success) { my $content = $response->content open FILE, '>' . ++$a . '.txt'; print FILE "$url\n"; print FILE $content; close FILE; print qq{Downloaded: "$url"\n}; push @queue, do { my $link_extractor = HTML::SimpleLinkExtor->new($u +rl); $link_extractor->parse($content); $link_extractor->a; }; $visited{$url} = 1; } else { dbmclose(%visited); die $response->status_line; } } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $http_ua->delay( 10 / 6000 ); crawl(@ARGV);
      This gives me: Can't call method "is_success" without a package or object reference at theusefulbot.pl line 21.

      Any Ideas?

      Thank you

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://336594]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2015-07-08 02:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (93 votes), past polls