Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: What is the fastest way to download a bunch of web pages?

by lestrrat (Deacon)
on Mar 03, 2005 at 16:08 UTC ( #436260=note: print w/ replies, xml ) Need Help??


in reply to What is the fastest way to download a bunch of web pages?

Just to make things more interesting, I'd suggest you take a look at even based approach, for example, via POE (POE::Component::Client::HTTP) or the like.

But I'd suggest that you keep this in the back of your head, and leave it for future, because it requires that you think about I/O, order of things, blah blah blah.

It was pretty hard for me personally to write a web crawler like that.

But anyway, it *is* possible to increase the performance of fetching websites to about 10K ~ 20K urls/hour using such an approach. And this is with a single process.


Comment on Re: What is the fastest way to download a bunch of web pages?
Replies are listed 'Best First'.
Re^2: What is the fastest way to download a bunch of web pages?
by tphyahoo (Vicar) on Mar 03, 2005 at 16:27 UTC
    Sounds promising. Any open code to do this?

      If you're looking to control how many child processes, Parallel::ForkManager may be helpful. The example source specifically demonstrates what I think you're trying to accomplish.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://436260]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (13)
As of 2015-07-07 18:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (93 votes), past polls