Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: What is the fastest way to download a bunch of web pages?

by BrowserUk (Patriarch)
on Mar 03, 2005 at 13:30 UTC ( [id://436197]=note: print w/replies, xml ) Need Help??


in reply to What is the fastest way to download a bunch of web pages?

It would be interesting to see how the various options stack up against each other, but they would all ned to be run from the same place to get anything like a reasonable comparison.

From where I am, it makes little or no difference to the total time whether I do all 10 in parallel or serially. The bottleneck is entirely the narrowness of my 40K pipe. YMMV.

To that end, here's a threaded solution:

#! perl -slw use strict; use threads; use Thread::Queue; use LWP::Simple; use Time::HiRes qw[ time ]; $|=1; our $THREADS ||= 3; our $PATH ||= 'tmp/'; sub fetch { my $Q = shift; while( $Q->pending ) { my $url = $Q->dequeue; my $start = time; warn "$url : " . getstore( "http://$url/", "$PATH$url.htm" ) . "\t" .( time() - $start ) . $/; } } my $start = time; my $Q = new Thread::Queue; $Q->enqueue( map{ chomp; $_ } <DATA> ); my @threads = map{ threads->new( \&fetch, $Q ) } 1 .. $THREADS; $_->join for @threads; print 'Done in: ', time - $start, ' seconds.'; __DATA__ www.google.com www.yahoo.com www.amazon.com www.ebay.com www.perlmonks.com news.yahoo.com news.google.com www.msn.com www.slashdot.org www.indymedia.org

Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

Replies are listed 'Best First'.
Re^2: What is the fastest way to download a bunch of web pages?
by tphyahoo (Vicar) on Mar 03, 2005 at 13:38 UTC
    6 seconds, definitely faster. I'm almost done following up on inman's tip, then I'll report whether his way was faster on my box. The difference seems to be that you restricted yourself to three threads, whereas he had no restriction.

    Anyway, thanks.

      The difference seems to be that you restricted yourself to three threads,

      Just add -THREADS=10 to the command line.

      Try varying the number 2/3/5/10 and see what works best for you. With my connection, the throughput is purely down to the download speed, but if you are on broadband, the network latency may come into play. Chossing the right balance of simultaneous requests versus bandwidth is a suck-it-and-see equation. It will depend on a lot of things including time of day, locations etc.

      You can also use -PATH=tmp/ to tell it wher to put the files.

      You really need to be doing more than 10 sites for a reasonable test anyway.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
      he had no restriction was due to personal laziness rather than an optimised answer. BrowserUK's solution is more engineered since it allocates a thread pool (with a variable number of threads) and therefore manages the total amount of traffic being generated at any one time.

      Let's say for example that you were trying to download 100 pages from the same website. My solution would batter the machine at the other and effectively be a denial of service attack. The thread pool managed approach allows you to tune your network use.

      There's more than one way to do it (and the other guy did it better!)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://436197]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-03-19 03:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found