Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

What is the fastest way to download a bunch of web pages?

by tphyahoo (Vicar)
on Mar 03, 2005 at 12:14 UTC ( [id://436173]=perlquestion: print w/replies, xml ) Need Help??

tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

Brother monks, I need to make a bunch of LWP::GET requests to grab html, where speed matters. The order the html gets grabbed is unimportant.

From browsing around PM, I think I could do this faster using forking, threading, Parallel::UserAgent, or some combination thereof. But which combination? To complicate matters, I'm on windows with ActiveState. Ideally I'd like to have a program that's OS independent, works on windows and unix.

But I'm a total threading newbie. Can somebody point me in the right direction? A program that seemed like it could be adapted to my task was merlyn's paralell stress tester but I'm just wondering if there's an easier/cleaner/faster way. This was back in 1998, and Merlyn wrote then that Parallel::UserAgent should be folded into the normal LWP library, but I don't think this has happened... has it?

use strict; use warnings; use LWP::UserAgent; my $output_file = "output.txt"; my %html; # Localtime printouts check only # how long it takes to download the html, # not do the printing print scalar localtime , "\n"; while (<DATA>) { $html{$_} = get_html($_); } print scalar localtime , "\n"; #output: #Thu Mar 3 13:10:59 2005 #Thu Mar 3 13:11:16 2005 # ~ 17 seconds #print out the html as a sanity check. open F, "> output.txt" or die "couldn't open output file"; foreach (keys %html) { print F "$_:\n" . $html{$_}; } close F; sub get_html { my $url = shift; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'); $ua->timeout(30); my $request = HTTP::Request->new('GET', "$url"); my $www = $ua->request($request); my $html; unless ( $www->is_success ) { # Attempt to get html failed. die "Failed to get html for $url" } else { # html retrieved ok. $html = $www->content; } return $html; } __DATA__
takes 16 seconds to print ten web pages to a file.

Is there a better/faster way? Can someone point me towards the light? Thanks!

UPDATE: I followed inman's tip and wound up with

#! /usr/bin/perl -w use strict; use warnings; use LWP; use threads; use Thread::Queue; my $query ="perl"; my $dataQueue = Thread::Queue->new; my $threadCount = 0; my $output_file = "output.txt"; my %html; my $start = time; while (<DATA>) { chomp; #s/^\s+//; s/\s+$//; #my ($engine, $url) = split /\s+/; #next unless $url; my $url = $_; my $thr = threads->new(\&get_html, $url); $thr->detach; $threadCount ++; } while ($threadCount) { my $url = $dataQueue->dequeue; $html{$url} = $dataQueue->dequeue; $threadCount --; } print "done in " . scalar ( time - $start) . " seconds."; #print out the html as a sanity check. open F, "> output.txt" or die "couldn't open output file"; foreach (keys %html) { print F "$_:\n" . $html{$_}; } close F; sub get_html { #my $engine = shift; my $url = shift; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/5.0'); $ua->timeout(10); $ua->env_proxy; my $response = $ua->get($url); if ($response->is_success) { $dataQueue->enqueue($url, $response->content); } else { $dataQueue->enqueue($url, $response->message); } } __DATA__
which ran in 8-14 seconds. A bit faster than what I started out with, but not as fast as what I was getting with BrowserUK's method below. Also, I would sometimes get "a thread exited while two other threads were running" warnings, not sure what this means. This never happened running BrowserUK's code.

I also agree with BrowserUK that 10 isn't enough to benchmark, so at some point I'll try this out grabbing 50 or 100 web pages at a time.

Replies are listed 'Best First'.
Re: What is the fastest way to download a bunch of web pages?
by BrowserUk (Patriarch) on Mar 03, 2005 at 13:30 UTC

    It would be interesting to see how the various options stack up against each other, but they would all ned to be run from the same place to get anything like a reasonable comparison.

    From where I am, it makes little or no difference to the total time whether I do all 10 in parallel or serially. The bottleneck is entirely the narrowness of my 40K pipe. YMMV.

    To that end, here's a threaded solution:

    #! perl -slw use strict; use threads; use Thread::Queue; use LWP::Simple; use Time::HiRes qw[ time ]; $|=1; our $THREADS ||= 3; our $PATH ||= 'tmp/'; sub fetch { my $Q = shift; while( $Q->pending ) { my $url = $Q->dequeue; my $start = time; warn "$url : " . getstore( "http://$url/", "$PATH$url.htm" ) . "\t" .( time() - $start ) . $/; } } my $start = time; my $Q = new Thread::Queue; $Q->enqueue( map{ chomp; $_ } <DATA> ); my @threads = map{ threads->new( \&fetch, $Q ) } 1 .. $THREADS; $_->join for @threads; print 'Done in: ', time - $start, ' seconds.'; __DATA__

    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
      6 seconds, definitely faster. I'm almost done following up on inman's tip, then I'll report whether his way was faster on my box. The difference seems to be that you restricted yourself to three threads, whereas he had no restriction.

      Anyway, thanks.

        The difference seems to be that you restricted yourself to three threads,

        Just add -THREADS=10 to the command line.

        Try varying the number 2/3/5/10 and see what works best for you. With my connection, the throughput is purely down to the download speed, but if you are on broadband, the network latency may come into play. Chossing the right balance of simultaneous requests versus bandwidth is a suck-it-and-see equation. It will depend on a lot of things including time of day, locations etc.

        You can also use -PATH=tmp/ to tell it wher to put the files.

        You really need to be doing more than 10 sites for a reasonable test anyway.

        Examine what is said, not who speaks.
        Silence betokens consent.
        Love the truth but pardon error.
        he had no restriction was due to personal laziness rather than an optimised answer. BrowserUK's solution is more engineered since it allocates a thread pool (with a variable number of threads) and therefore manages the total amount of traffic being generated at any one time.

        Let's say for example that you were trying to download 100 pages from the same website. My solution would batter the machine at the other and effectively be a denial of service attack. The thread pool managed approach allows you to tune your network use.

        There's more than one way to do it (and the other guy did it better!)

Re: What is the fastest way to download a bunch of web pages?
by inman (Curate) on Mar 03, 2005 at 12:25 UTC
    Check out this node as a start.
      Yes, that's pretty much what I needed. Will rework my post from that node and update. Thanks!

      Conclusion is

      use threads; use Thread::Queue;
      works ok on windows. Whether it's faster, I will say after I rework and test it.
Re: What is the fastest way to download a bunch of web pages?
by lestrrat (Deacon) on Mar 03, 2005 at 16:08 UTC

    Just to make things more interesting, I'd suggest you take a look at even based approach, for example, via POE (POE::Component::Client::HTTP) or the like.

    But I'd suggest that you keep this in the back of your head, and leave it for future, because it requires that you think about I/O, order of things, blah blah blah.

    It was pretty hard for me personally to write a web crawler like that.

    But anyway, it *is* possible to increase the performance of fetching websites to about 10K ~ 20K urls/hour using such an approach. And this is with a single process.

      Sounds promising. Any open code to do this?

        If you're looking to control how many child processes, Parallel::ForkManager may be helpful. The example source specifically demonstrates what I think you're trying to accomplish.

Re: What is the fastest way to download a bunch of web pages?
by Anonymous Monk on Mar 03, 2005 at 12:28 UTC
    Maybe. There are many parameters that determine what the most efficient way is to download a number of webpages. Some of the more important issues are the capacity of your box (number of CPUs, your amount of memory, your disk I/O, your network I/O, what else is running on it), the capacity of the network between your and the servers you are downloading from, and the setup of the servers you are querying.

    If you are really serious about the speed issue, you need to look at your infrastructure. All we can do here is guess, or present our own experience as the absolute truth, both not uncommon on Perl forums, but probably not very useful for anyone.

      Thanks for the quick answer.

      Well, like I said, I'm developing on WinXP, ActiveState. Box, it's modern but nothing special. 512MB Ram, 1 CPU, Ghz I don't know, whatever was standard for new desktops in 2004.

      I have a vanilla 512MB dsl connection with deutsche telekom, as far as I know.

      Why would disk IO matter, and how do I find out disk IO? Ditto for the capacity of the network.

      If I accomplish what I want to accomplish, when this leaves the development phase I may be running the code on a linux box with more juice. Basically I just want to keep things flexible for the future.

        If you want to get a solid advice just based on a few raw specs, hire a consultant. There are many consultants who want to make a quick buck by giving advice based on just the numbers. You're mistaken if you think that there's a table that say that for those specs, this and that is the best algorithm.

        As for why disk I/O matters, well, I'm assuming you want to store your results, and you're downloading a significant amount of data, enough to not be able to keep in all in memory. So, you have to write to disk. Which means that it's a potential bottleneck (if all the servers you download from are on your local LAN, you could easily get more data per second over the network than your disk can write - depending of course on the disk(s) and the network).

        Of course, if all you care about is downloading a handful of pages, each from a different server, in a reasonable short time, perhaps something as simple as:

        system "wget $_ &" for @urls;
        will be good enough. But that doesn't work well if you need to download 10,000 documents, all from the same server.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://436173]
Approved by Arunbear
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-05-22 05:44 GMT
Find Nodes?
    Voting Booth?

    No recent polls found