Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Web Crawler

by matija (Priest)
on Mar 09, 2004 at 08:09 UTC ( #335036=note: print w/replies, xml ) Need Help??


in reply to Web Crawler

Our way, here at the monastery, is to select "comment on", and put comments where everybody can see them (and mod them up or down, as appropriate).

I think you are getting into needless complication with the files. When I write webcrawlers, I usualy use an array to hold URLs I have yet to download (push them in at the end, shift them off the front end), and a hash to tell me which URLs I've already pushed into the array (NOT the ones I've already downloaded: why have multiple copies of the same URL in the array?).

Of course, your first question would be, what happens when that array and that has become really, really big (which can happen quite easily on the internet). And the answer is: When that happens, you can either use DB_File to tie both the array and the hash, or you can use a real database.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://335036]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2020-10-28 06:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (260 votes). Check out past polls.

    Notices?