Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^5: Consumes memory then crashs

by BrowserUk (Pope)
on Mar 24, 2012 at 15:40 UTC ( #961415=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Consumes memory then crashs
in thread Consumes memory then crashs

Now I am back to where I started, 1 response ever 1 or 2 seconds, too slow. Is there a simple solution to thread this properly?

Yes. Try this:

#! perl -slw use strict; use threads; use threads::shared; use Thread::Queue; use LWP::Simple; my $sem :shared; sub lookup { my( $fh, $name ) = @_; my $lookup = get( "http://rscript.org/lookup.php?type=track&time=62899200&user=$ +name&skill=all" ); print "Looking up $name...\n"; if( $lookup =~ m/gain:Overall:\d+:(\d+)/isg ) { lock $sem; print { $fh } "$name $1\n"; } elsif( $lookup =~ m/(ERROR)/isg ) { lock $sem; print { $fh } "$name doesn't exist \n" } else{ lock $sem; print { $fh } "$name 0\n"; } } our $THREADS //= 4; my $names = 'zezima fred bill john jack'; my $Q = new Thread::Queue; open( LOOKUP, '>>rstlookup.txt' ) or die $!; my @threads = map async( sub { while( my $name = $Q->dequeue ) { lookup( \*LOOKUP, $name ); } } ), 1 .. $THREADS; while( $names =~ m/([a-z0-9_]+)/isg ) { $Q->enqueue( $1 ); sleep 1 while $Q->pending > $THREADS * 2; } $Q->enqueue( (undef) x $THREADS ); $_->join for @threads; close( LOOKUP ); __END__ [15:38:57.93] C:\test>junk39 Looking up john... Looking up bill... Looking up fred... Looking up zezima... Looking up jack... [15:39:03.07] C:\test>type rstlookup.txt bill 0 fred 135601422 zezima 417155645 john 0 jack 8133157

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?


Comment on Re^5: Consumes memory then crashs
Download Code
Re^6: Consumes memory then crashs
by allhellno (Novice) on Mar 24, 2012 at 16:00 UTC
    Excellent, this has taught me a bit and is greatly appreciated!

      Sorry for the delay in getting back to you, but installing and configuring my new motherboard took longer than anticipated.

      Please re-fresh the threaded code I posted from above. I added a fix.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

Re^6: Consumes memory then crashs
by mbethke (Hermit) on Mar 24, 2012 at 19:35 UTC

    This is actually a good case in point of my earlier post on threads. In a way, threads are like gatling guns: if you're The Terminator and can handle them, they can be very effective; for most people however they provide a million opportunities for shooting themselves in the foot. Unlike gatlings the holes produced may be rather subtle though, and may appear after a long time of seemingly successful use---the typical heisenbugs that appear once in a while, but never while you look closely.

    The problem here is that print is not atomic, in fact most of stdio is taboo in threaded code without further protective measures. A thread may be preempted after writing a fraction of a buffer and then resume after another thread has written to the same file. In your example that waits a lot between printing lines, the probability for this to happen is really very small, but that doesn't mean it can't happen to the first two lines of output. Here's a script that provokes it:

    use strict; use threads; open my $fh, '>', 'outfile' or die $!; my $th = 0; my @threads = map { $th++; async( sub { sleep(1); for(1 .. 30_000) { print $fh "Thread $th\n" + } } ); } (1 .. 500); $_->join foreach @threads; close $fh;

    Sample output snippet:

    Thread 3Thread 349 Threead 85 Thread Thread 333 Thre59 ad 349 Thread 3ad 333 Thread 3Thread 359 Thre49 33 ad 295 ThreThread 333 Thread 8Thread 338 ThreThread 350

    For an application like retrieving a large number of web pages where waiting for the other side is the major cause of delays (so spreading it out over multiple cores has no significant advantage), the solution of choice is the state machine. Event based programming may look like a lot of work to wrap one's head around but in the end it's easier to understand than threads if you consider all the rather lowlevely race conditions and other synchronization issues that you have to think about to write thread code that always works and not just most of the time.

    Regarding modules to facilitate the implementation of said state machine, one I found easy to use (actually the only one I've ever used in production code) is POE::Component::Client::UserAgentPOE::Component::Client::HTTP. (edited, it's been a while but the name didn't sound quite right) POE is rather heavyweight though (not that it mattered much here) so AnyEvent::Curl::Multi might be worth a look too.

      This may be a little like buying the whole kitchen sink just to get the faucet knob, but Mojolicious contains Mojo::UserAgent, which can be used as a non-blocking event driven User Agent. The Mojolicious::Guides::Cookbook contains a section on "Non-blocking" User Agent examples.

      I think the whole Mojolicious distribution is around 2-MB, which is a lot for just a user agent, but trivially small if you have a use for the other features as well. Plus, it has no external dependencies (one of the Mojolicious design goals, which some consider "a good thing").


      Dave

      The problem here is that print is not atomic ... Here's a script that provokes it:

      Hm. And the fix is sooo complicated:

      use strict; use threads; use threads::shared; my $sem :shared; open my $fh, '>', 'outfile' or die $!; my $th = 0; my @threads = map { $th++; async( sub { sleep(1); for(1 .. 30_000) { lock $sem; print $fh "Th +read $th\n" } } ); } (1 .. 500); $_->join foreach @threads; close $fh;

      A whole 3 lines.

      Regarding modules to facilitate the implementation of said state machine, one I found easy to use (actually the only one I've ever used in production code) is POE::Component::Client::UserAgentPOE::Component::Client::HTTP. (edited, it's been a while but the name didn't sound quite right) POE is rather heavyweight though (not that it mattered much here) so AnyEvent::Curl::Multi might be worth a look too.

      Okay, so where's the code? How about you run what you brung?

      Betcha don't!

      And if you do, betcha it takes you 10 times longer to write; requires 10 times as much (user) code; requires 20 times as many support modules that accumulate to be 30 times as much actual non-core code to trust the authors of and require outside support for when it goes wrong; and finally, runs slower and less efficiently than the 5 minute-to-write, 30 line threaded script above.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        Okay, so where's the code?

        You just had to follow the link provided by davido to find an example. It took me 5 minutes to change it, despite I never used Mojo before:

        use 5.010; use strict; use warnings; use Mojo::UserAgent; use Mojo::IOLoop; use Mojo::URL; # FIFO queue my @names = qw(zezima fred bill john jack); # User agent following up to 5 redirects my $ua = Mojo::UserAgent->new( inactivity_timeout => 1 ); sub url_for_name { my $name = shift; return "http://rscript.org/lookup.php?type=track&time=62899200&use +r=$name&skill=all"; } # Crawler my $crawl; $crawl = sub { my $id = shift; return unless my $name = shift @names; say "Looking for $name"; # Fetch non-blocking just by adding a callback $ua->get( url_for_name($name) => sub { my ( $ua, $tx ) = @_; my $body = $tx->res->body; if ( $body =~ m/gain:Overall:\d+:(\d+)/i ) { say "$name $1"; } elsif ( $body =~ m/(ERROR)/i ) { say "$name doesn't exist"; } else { say "$name 0"; } # Next $crawl->($id); } ); }; # Start a bunch of parallel crawlers sharing the same user agent $crawl->($_) for 1 .. 4; # Start reactor Mojo::IOLoop->start;
        And if you do, betcha it takes you 10 times longer to write; requires 10 times... less efficiently...

        And that is just rubbish, especially talking about efficiency. I handled hundreds of simultaneous connections with AnyEvent::HTTP, and it didn't really consume a lot of CPU or memory, with threads it would went to swap.

        No need to get ironic---my point was merely that threaded code is hard to get right, even for a good and experienced programmer. Your first example illustrated that well, and so does your fix. Here's a comparison of runtimes:

        $ time perl threads_yours.pl real 7m13.312s user 2m1.484s sys 10m8.982s $ time perl threads_mine.pl real 0m2.209s user 0m6.028s sys 0m0.604s

        And here's a bit of the output:

        $ uniq -c outfile | head 2275 Thread 1 1 ThreaThread 2 454 Thread 2 1 TThread 3 909 Thread 3 1 ThThread 5 454 Thread 5 1 Td 1 1364 Thread 1 1 Thread 1hread 2

        This is a bog-standard Perl 5.10.1 on AMD64/i7 as it comes with Debian Squeeze¹. Either you overlooked yet another pitfall (I can't tell what it is---my point again: it looks deceptively simple but ain't) or the library implementation is buggy, either way its behavior is certainly more correct than my example's at the cost of being almost 200 times slower, but not quite correct yet.

        As for the POE code, it took me indeed some 15 minutes to write:

        #!/usr/bin/perl use strict; use POE qw(Component::Client::HTTP); use HTTP::Request; my @names = qw/ zezima fred bill john jack /; open my $fh, '>', 'outfile' or die $!; sub start_req { my $name = shift @names or return; POE::Kernel->post(weeble => request => response => HTTP::Request->new( GET => "http://rscript.org/lookup.php?type=track&time=6289 +9200&user=$name&skill=all" ), $name ); } POE::Session->create( inline_states => { _start => sub { POE::Component::Client::HTTP->spawn; start_req for(1 .. 5); }, response => sub { my $name = $_[ARG0]->[1]; my $result = $_[ARG1]->[0]{_content}; if($result =~ m/gain:Overall:\d+:(\d+)/isg) { print { $fh } "$name $1\n"; } elsif($result =~ m/(ERROR)/isg) { print { $fh } "$name doesn't exist \n" } else { print { $fh } "$name 0\n"; } start_req; }, }, ); POE::Kernel->run; close $fh;

        I haven't used that component the last two years, so yeah, I was slow because I had to look up the defaults and how to pass the HTTP::Request object again. As long as an implementation will not trigger my pager by barfing at 4 in the morning and then look all innocent when I try and debug it, I think that's time well spent.

        As for efficiency: no. The 500 do-almost-nothing threads in your code (mine didn't run long enough to register in top) need a resident set of 327 MB here (virtual size is slightly over 4 GB), I didn't try with the web scraper but I don't see how it could do any better. If I start 500 parallel requests in the POE version (well, my line here is 256 kbit on a sunny day during low tide ...), it takes 21 MB as opposed to 18 with five requests. Code-size-wise, all POE modules I have installed together are slightly over 40 kLOC including POD. I'll leave the comparison to Perl's thread code plus pthreads or whatever that builds on on your box to you. My user code is 36 non-empty lines (OK, I cheated you for two lines in the if/elsif block because that matches my style), yours is 42 so far.

        ¹ 5.14.2 on a newer kernel and a Phenom-II shows the same behavior, it only takes 260 MB but is even slower, I stopped it after over 20 minutes of CPU time

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://961415]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (9)
As of 2014-12-26 04:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (165 votes), past polls