Get 10,000 web pages fast

Mad_Mac has asked for the wisdom of the Perl Monks concerning the following question:

I have a list of 10,000~ URLS, stored in a hash with a user friendly name, that I need to retrieve from a webserver, for local parsing and analysis.

My code seems to have a memory leak, and eventually cannot fork because it runs out of resources.

Here's the relevant bit of my code:

$curcount=0;
$url_count = keys %url_list;
my $pm = new Parallel::ForkManager(100);
foreach $url (keys %url_list) {
    $curcount++;
    my $fname = $url_list{$url};
    printf STDERR ("\r%02d ($fname) of $url_count files
          retrieved.", $curcount);
    $pm->start and next;
    getstore($url,$fname) or die 'Failed to get page';
    $pm->finish;    
}
$pm->wait_all_children;
[download]

I thought of trying to use LWP::Parallel, but it doesn't want to install on my system. If it matters, I am doing this with Strawberry Perl in a Win x32 VM on a Linux Mint X64 host. I'm not sure exactly which version of Perl. The msi from Strawberrys site says 5.12.1, but perl -ver says 5.10.1. The host has 8GB RAM, and I have allocated 4GB to the Win7 VM. The Perl process starts out using ~600 MB and creeps up to ~2GB before it crashes (sometimes sooner).

So, my questions are:

How do I stop this from using up all the memory in the VM?

Is there a faster way to grab lots of pages in at the same time?

Thanks

Comment on Get 10,000 web pages fast Select or Download Code

Replies are listed 'Best First'.
Re: Get 10,000 web pages fast by BrowserUk (Patriarch) on Jun 17, 2010 at 12:58 UTC
Using windows fork emulation for this is not a good idea. A pool of threads will be far more efficient (on Windows), and is hardly more complex: #! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use LWP::Simple; sub worker { my( $Q, $curCountRef ) = @_; while( my $work = $Q->dequeue ) { my( $url, $name ) = split $;, $work; my $rc =getstore( $url, $name ); warn( "Failed to fetch $url: $rc\n" ), next if $rc != RC_OK; lock $$curCountRef; printf STDERR "%06d:($name) fetched\n", ++$$curCountRef; } } our $W //= 20; our $urlFile //= 'urls.txt'; my $Q = new Thread::Queue; my $curCount :shared = 0; my @threads = map{ threads->create( \&worker, $Q, \$curCount ); } 1 .. $W; open URLS, '<', $urlFile or die "$urlFile : $!"; my $fileNo = 0; while( my $work = <URLS> ) { chomp $work; $Q->enqueue( sprintf "%s$;./tmp/saved.%06d", $work, ++$fileNo ); sleep 1 while $Q->pending > $W; } close URLS; $Q->enqueue( ( undef ) x $W ); $_->join for @threads; [download] By leaving the urls in a file, this will read them on demand and save filling memory with many copies of them. As posted, this expects the urls to be in a file called ./urls.txt. And will run a pool of 20 workers (`thisScript -W=20 -urlFile=./urls.txt`). The retrieved files are written to `tmp/saved.nnnnnn` Adjust to suit. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^2: Get 10,000 web pages fast by Mad_Mac (Beadle) on Sep 28, 2010 at 11:49 UTC
BrowserUk, It's been a while since you posted your recommendation. I just wanted to let you know, I used your suggested code and it worked great ... until today. I started getting `Free to wrong pool 2d3e040 not 778ea0 during global destruction.` errors. Obviously the memory address changes with each run, but the error is moderately consistent across two different W7 builds. Both are running Strawberry Perl 5.12. One is using threads v 1.77 and the other is v1.81. I've been Googling this error, but haven't found a solution yet. I'd appreciate any suggestions you (or anyone) has on this. Thanks	[reply] [d/l]
Re^3: Get 10,000 web pages fast by BrowserUk (Patriarch) on Sep 28, 2010 at 15:13 UTC
Revert to threads 1.76 and send a bug report to the modules maintainer, with the polite suggestion that he test his changes more thoroughly. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re: Get 10,000 web pages fast by Anonymous Monk on Jun 17, 2010 at 12:13 UTC
I would split the list into 100 files, and launch 100 instances of wget or curl, and let them do the mirroring.	[reply]
Re: Get 10,000 web pages fast by aquarium (Curate) on Jun 18, 2010 at 04:03 UTC
All the monks are giving good help. In terms of resolving the problem with current code, i'd be asking myself why the getstore function is using much memory at all. hopefully it's not doing something really silly like slurping a whole website into a scalar before writing to file. does the program write any files at all before dying? is getstore() aware that it's being passed keys and not actual urls? maybe you already have all these answers or it's not that helpful..but at least i find that sometimes even the most experienced get stumped by simple things. have fun eating web (with your script) the hardest line to type correctly is: stty erase ^H	[reply]
Re: Get 10,000 web pages fast by pemungkah (Priest) on Jun 19, 2010 at 02:15 UTC
First: your script should check the `robots.txt` at each site to determine whether or not automated scraping is welcome. You should also try to make sure that you don't hammer any of the sites you're hitting into the ground by spreading out accesses to a single server over time. Otherwise you'll unintentionally do a denial of service attack on the site you're fetching and really tick people off. If these are all different sites this isn't such a big deal. As a rule of thumb, reaccessing another URL the same site in less than one second may get you noticed and possibly yelled at by both the person you're scanning, and the ISP you're using (or the IT guys if you're doing this at work). Some sites (I know Yahoo! does it from having worked there) will actually stop serving you real pages and just return an error page if you hit them too hard or too often.	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks