Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider

by mkurtis (Scribe)
on Mar 13, 2004 at 18:05 UTC ( [id://336389]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider
in thread Cutting Out Previously Visited Web Pages in A Web Spider

Thanks so much kappa, i sure wish i could vote more than once for your post. But I have some problems still, how do I make it follow the links it extracts, it just stops. For example, when I start it on wired.com, it creates 77 files and then brings up the command prompt. I have modified your code into this:
#!/usr/bin/perl -w use strict; use LWP::RobotUA; use HTML::SimpleLinkExtor; use vars qw/$http_ua $link_extractor/; my @queue; @queue = qw ("http://www.wired.com"); sub crawl { my $a = 0; my %visited; my $links; my @links; while(my $url = shift @queue) { next if $visited{$url}; my $content = $http_ua->get($url)->content; open(FILE,">/var/www/data/$a.txt"); print FILE "$url\n"; print FILE "$content"; close(FILE); print qq{Downloaded: "$url"\n}; push @queue, do { $link_extractor->parse($content); @links = $link_extractor->a }; foreach $links(@links) { unshift @queue, $links; } $visited{$url} = 1; $a++; } } $http_ua = new LWP::RobotUA theusefulbot => 'bot@theusefulnet.com'; $http_ua->delay(10/6000); $link_extractor = new HTML::SimpleLinkExtor; crawl(@ARGV);
Also, what do I do when the array gets too large?

Thanks again

Replies are listed 'Best First'.
Re: Re: Re: Re: Cutting Out Previously Visited Web Pages in A Web Spider
by kappa (Chaplain) on Mar 14, 2004 at 20:51 UTC
    You unshift links into the queue after pushing them there several lines above. That's weird, but does not matter as it never crawls to the same url for two times. My original code did everything you need about links and queueing, btw. Next, I can't debug mirroring wired.com, sorry :) I pay for traffic. Try to watch the growing queue of pending visits and catch the moment your script finishes. And last. Your arrays won't get too large anytime soon. Really. Your computer will be able to handle an array of million of links, I suppose, without much problems. I'd suggest filtering visited links before adding new ones to the queue and not before crawling as the first possible optimization.
      How do I get it to go to other pages though. When I visit wired.com for example, I want it to take all the links off of it and visit them. And for each page that it visits off of wired, take the links off those pages and visit them and so on. This one only takes the links off of wired.com, not any of the pages that are linked to wired.

      Thank you

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://336389]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (7)
As of 2024-03-29 09:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found