Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Get 10,000 web pages fast

by Mad_Mac (Beadle)
on Jun 17, 2010 at 11:52 UTC ( [id://845186]=perlquestion: print w/replies, xml ) Need Help??

Mad_Mac has asked for the wisdom of the Perl Monks concerning the following question:

I have a list of 10,000~ URLS, stored in a hash with a user friendly name, that I need to retrieve from a webserver, for local parsing and analysis.

My code seems to have a memory leak, and eventually cannot fork because it runs out of resources.

Here's the relevant bit of my code:

$curcount=0; $url_count = keys %url_list; my $pm = new Parallel::ForkManager(100); foreach $url (keys %url_list) { $curcount++; my $fname = $url_list{$url}; printf STDERR ("\r%02d ($fname) of $url_count files retrieved.", $curcount); $pm->start and next; getstore($url,$fname) or die 'Failed to get page'; $pm->finish; } $pm->wait_all_children;

I thought of trying to use LWP::Parallel, but it doesn't want to install on my system. If it matters, I am doing this with Strawberry Perl in a Win x32 VM on a Linux Mint X64 host. I'm not sure exactly which version of Perl. The msi from Strawberrys site says 5.12.1, but perl -ver says 5.10.1. The host has 8GB RAM, and I have allocated 4GB to the Win7 VM. The Perl process starts out using ~600 MB and creeps up to ~2GB before it crashes (sometimes sooner).

So, my questions are:

  • How do I stop this from using up all the memory in the VM?
  • Is there a faster way to grab lots of pages in at the same time?
  • Thanks

    Replies are listed 'Best First'.
    Re: Get 10,000 web pages fast
    by BrowserUk (Patriarch) on Jun 17, 2010 at 12:58 UTC

      Using windows fork emulation for this is not a good idea.

      A pool of threads will be far more efficient (on Windows), and is hardly more complex:

      #! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use LWP::Simple; sub worker { my( $Q, $curCountRef ) = @_; while( my $work = $Q->dequeue ) { my( $url, $name ) = split $;, $work; my $rc =getstore( $url, $name ); warn( "Failed to fetch $url: $rc\n" ), next if $rc != RC_OK; lock $$curCountRef; printf STDERR "%06d:($name) fetched\n", ++$$curCountRef; } } our $W //= 20; our $urlFile //= 'urls.txt'; my $Q = new Thread::Queue; my $curCount :shared = 0; my @threads = map{ threads->create( \&worker, $Q, \$curCount ); } 1 .. $W; open URLS, '<', $urlFile or die "$urlFile : $!"; my $fileNo = 0; while( my $work = <URLS> ) { chomp $work; $Q->enqueue( sprintf "%s$;./tmp/saved.%06d", $work, ++$fileNo ); sleep 1 while $Q->pending > $W; } close URLS; $Q->enqueue( ( undef ) x $W ); $_->join for @threads;

      By leaving the urls in a file, this will read them on demand and save filling memory with many copies of them. As posted, this expects the urls to be in a file called ./urls.txt. And will run a pool of 20 workers (thisScript -W=20 -urlFile=./urls.txt). The retrieved files are written to tmp/saved.nnnnnn Adjust to suit.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        BrowserUk,

        It's been a while since you posted your recommendation. I just wanted to let you know, I used your suggested code and it worked great ... until today.

        I started getting Free to wrong pool 2d3e040 not 778ea0 during global destruction. errors. Obviously the memory address changes with each run, but the error is moderately consistent across two different W7 builds. Both are running Strawberry Perl 5.12. One is using threads v 1.77 and the other is v1.81.

        I've been Googling this error, but haven't found a solution yet. I'd appreciate any suggestions you (or anyone) has on this.

        Thanks

          Revert to threads 1.76 and send a bug report to the modules maintainer, with the polite suggestion that he test his changes more thoroughly.


          Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
          "Science is about questioning the status quo. Questioning authority".
          In the absence of evidence, opinion is indistinguishable from prejudice.
    Re: Get 10,000 web pages fast
    by Anonymous Monk on Jun 17, 2010 at 12:13 UTC
      I would split the list into 100 files, and launch 100 instances of wget or curl, and let them do the mirroring.
    Re: Get 10,000 web pages fast
    by aquarium (Curate) on Jun 18, 2010 at 04:03 UTC
      All the monks are giving good help. In terms of resolving the problem with current code, i'd be asking myself why the getstore function is using much memory at all. hopefully it's not doing something really silly like slurping a whole website into a scalar before writing to file. does the program write any files at all before dying? is getstore() aware that it's being passed keys and not actual urls? maybe you already have all these answers or it's not that helpful..but at least i find that sometimes even the most experienced get stumped by simple things. have fun eating web (with your script)
      the hardest line to type correctly is: stty erase ^H
    Re: Get 10,000 web pages fast
    by pemungkah (Priest) on Jun 19, 2010 at 02:15 UTC
      First: your script should check the robots.txt at each site to determine whether or not automated scraping is welcome.

      You should also try to make sure that you don't hammer any of the sites you're hitting into the ground by spreading out accesses to a single server over time. Otherwise you'll unintentionally do a denial of service attack on the site you're fetching and really tick people off. If these are all different sites this isn't such a big deal.

      As a rule of thumb, reaccessing another URL the same site in less than one second may get you noticed and possibly yelled at by both the person you're scanning, and the ISP you're using (or the IT guys if you're doing this at work).

      Some sites (I know Yahoo! does it from having worked there) will actually stop serving you real pages and just return an error page if you hit them too hard or too often.

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Domain Nodelet?
    Node Status?
    node history
    Node Type: perlquestion [id://845186]
    Approved by Paladin
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this?Last hourOther CB clients
    Other Users?
    Others sharing their wisdom with the Monastery: (6)
    As of 2024-04-24 06:49 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found