Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Brother monks, I need to make a bunch of LWP::GET requests to grab html, where speed matters. The order the html gets grabbed is unimportant.

From browsing around PM, I think I could do this faster using forking, threading, Parallel::UserAgent, or some combination thereof. But which combination? To complicate matters, I'm on windows with ActiveState. Ideally I'd like to have a program that's OS independent, works on windows and unix.

But I'm a total threading newbie. Can somebody point me in the right direction? A program that seemed like it could be adapted to my task was merlyn's paralell stress tester but I'm just wondering if there's an easier/cleaner/faster way. This was back in 1998, and Merlyn wrote then that Parallel::UserAgent should be folded into the normal LWP library, but I don't think this has happened... has it?

use strict; use warnings; use LWP::UserAgent; my $output_file = "output.txt"; my %html; # Localtime printouts check only # how long it takes to download the html, # not do the printing print scalar localtime , "\n"; while (<DATA>) { $html{$_} = get_html($_); } print scalar localtime , "\n"; #output: #Thu Mar 3 13:10:59 2005 #Thu Mar 3 13:11:16 2005 # ~ 17 seconds #print out the html as a sanity check. open F, "> output.txt" or die "couldn't open output file"; foreach (keys %html) { print F "$_:\n" . $html{$_}; } close F; sub get_html { my $url = shift; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'); $ua->timeout(30); my $request = HTTP::Request->new('GET', "$url"); my $www = $ua->request($request); my $html; unless ( $www->is_success ) { # Attempt to get html failed. die "Failed to get html for $url" } else { # html retrieved ok. $html = $www->content; } return $html; } __DATA__ http://www.google.com http://www.yahoo.com http://www.amazon.com http://www.ebay.com http://www.perlmonks.com http://news.yahoo.com http://news.google.com http://www.msn.com http://www.slashdot.org http://www.indymedia.org
takes 16 seconds to print ten web pages to a file.

Is there a better/faster way? Can someone point me towards the light? Thanks!

UPDATE: I followed inman's tip and wound up with

#! /usr/bin/perl -w use strict; use warnings; use LWP; use threads; use Thread::Queue; my $query ="perl"; my $dataQueue = Thread::Queue->new; my $threadCount = 0; my $output_file = "output.txt"; my %html; my $start = time; while (<DATA>) { chomp; #s/^\s+//; s/\s+$//; #my ($engine, $url) = split /\s+/; #next unless $url; my $url = $_; my $thr = threads->new(\&get_html, $url); $thr->detach; $threadCount ++; } while ($threadCount) { my $url = $dataQueue->dequeue; $html{$url} = $dataQueue->dequeue; $threadCount --; } print "done in " . scalar ( time - $start) . " seconds."; #print out the html as a sanity check. open F, "> output.txt" or die "couldn't open output file"; foreach (keys %html) { print F "$_:\n" . $html{$_}; } close F; sub get_html { #my $engine = shift; my $url = shift; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/5.0'); $ua->timeout(10); $ua->env_proxy; my $response = $ua->get($url); if ($response->is_success) { $dataQueue->enqueue($url, $response->content); } else { $dataQueue->enqueue($url, $response->message); } } __DATA__ http://www.google.com http://www.yahoo.com http://www.amazon.com http://www.ebay.com http://www.perlmonks.com http://news.yahoo.com http://news.google.com http://www.msn.com http://www.slashdot.org http://www.indymedia.org
which ran in 8-14 seconds. A bit faster than what I started out with, but not as fast as what I was getting with BrowserUK's method below. Also, I would sometimes get "a thread exited while two other threads were running" warnings, not sure what this means. This never happened running BrowserUK's code.

I also agree with BrowserUK that 10 isn't enough to benchmark, so at some point I'll try this out grabbing 50 or 100 web pages at a time.


In reply to What is the fastest way to download a bunch of web pages? by tphyahoo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others perusing the Monastery: (8)
    As of 2015-07-04 13:56 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (60 votes), past polls