Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

dynamic number of threads based on CPU utilization

by mabossert (Beadle)
on Sep 26, 2012 at 14:24 UTC ( #995783=perlquestion: print w/ replies, xml ) Need Help??
mabossert has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I am writing a program to convert a large number of XML files into RDF triples and also, based on certain fields in the XML, populate a hash with those fields' values. Once the hash has been populated, I run an algorithm that determines the relative similarity of each string to the others.

For the number of threads, I decided to find the number of CPU's in the machine so that the code could be easily ported to another server with more CPU's without any additional configuration.

The code works, both in the original single-threaded and subsequent multi-threaded versions. The issue I am running into is that when the program starts, CPU utilization is hovering around 50%, but as the program runs, it peeters out and eventually drops to around 3-4% and the program chugs along, albeit very slowly.

In addition to trying to figure out why the CPU utilization is not staying constant, I would like to find a way to dynamically add threads based on CPU utilization such that more work can be done when the CPU load falls below a certain threshold (arbitrarily, let's say 25%).

I would appreciate any feedback on this code. For brevity, I have not included the subroutine code for the XML conversion, but the rest of the code is here. Also, if there are any ways to speed up the algorithm that is comparing strings, I would welcome any suggestions.

#!/usr/bin/perl use strict; use warnings; use Carp qw(carp cluck croak confess); use XML::Hash; use File::Slurp; use Date::Parse; binmode STDOUT, ":utf8"; use threads; use threads::shared; use Thread::Queue; use Sys::CPU; use Devel::Size qw(size total_size); use List::MoreUtils qw(uniq); #use Data::Dumper; local $| = 1; print `/bin/date`."\n"; our $THREADS = Sys::CPU::cpu_count()*2; my $dir='/xmlFeeds'; my ($DIR,@files); opendir($DIR,$dir); foreach(readdir($DIR)) { push @files, $_ if $_ =~ m/.*\.xml/; } closedir($DIR); my $outFile='./out.nt'; my $OUTFILE; open($OUTFILE,'>:utf8',$outFile); my %similar :shared; my $recordCount :shared; $recordCount=1; my $Qwork = new Thread::Queue; ## Create the pool of workers my @pool = map{ threads->create( \&worker, $Qwork ) } 1 .. $THREADS; $Qwork->enqueue(@files); ## Tell the workers there are no more work items $Qwork->enqueue( (undef) x $THREADS ); ## Clean up the threads $_->join for @pool; ## Process similar domain names, file names, etc. ## Create the pool of workers @pool=(); @pool = map{ threads->create( \&worker2, $Qwork ) } 1 .. $THREADS; foreach (keys %similar) { $Qwork->enqueue({$_ => $similar{$_}}); } ## Tell the workers there are no more work items $Qwork->enqueue( (undef) x $THREADS ); ## Clean up the threads $_->join for @pool; close($OUTFILE); print `/bin/date`."\n"; sub worker { my $tid = threads->tid; my( $Qwork ) = @_; while( my $file = $Qwork->dequeue ) { my $triple=procXml($file); print $OUTFILE $triple if defined $triple; } } sub worker2 { my $tid = threads->tid; my( $Qwork ) = @_; while( my $file = $Qwork->dequeue ) { while ( my ($key, $val) = each(%$file) ) { my $triple=simAlg($key,$val); print $OUTFILE $triple if defined $triple; } } } sub simAlg { my($dom,$type)=@_; my $triple; chomp $dom; my @w1=unpack("(A2)*", $dom); @w1=uniq(@w1); my %w1H=map{$_ => 1} @w1; foreach my $odom(keys %similar) { chomp $odom; my $innerType = $similar{$odom}; next if $odom eq $dom; my @w2=unpack("(A2)*", $odom); @w2=uniq(@w2); my @counter=grep($w1H{$_},@w2); my $value=((@counter)*2)/(@w1+@w2); if($value >= 0.9) { $triple .= qq|<http://cs.org/$type#$dom> <http://cs.org/p/ +similarName> <http://cs.org/$innerType#$odom> .\n|; print $triple; } } return $triple; } sub procXml { [code here] }

Comment on dynamic number of threads based on CPU utilization
Download Code
Re: dynamic number of threads based on CPU utilization
by daxim (Chaplain) on Sep 26, 2012 at 14:52 UTC
Re: dynamic number of threads based on CPU utilization
by BrowserUk (Pope) on Sep 26, 2012 at 15:44 UTC
    In addition to trying to figure out why the CPU utilization is not staying constant, I would like to find a way to dynamically add threads based on CPU utilization such that more work can be done when the CPU load falls below a certain threshold (arbitrarily, let's say 25%).

    Forget the idea of throwing more threads at the problem in order to increase your cpu utilisation.

    • If twice as many cpu-bound (no IO) threads as you have cores is not sufficient to fully utilise your cpu capacity, then the problem is something to do with the design of your code.

      And throwing more threads at the problem will likely slow it down rather than speed it up.

    • Whatever tool or system call you use to measure cpu utilisation, the figure you get will be an instantaneous value.

      And even on the most cpu bound system, it will not be a smooth, roughly constant figure, but will fluctuate wildly between 0% in one instance and 100% in the next. This is normal, but completely useless for basing decisions on.

      Even if you apply some smoothing algorithm over time, the effects of feedback will mean that you will simply overload your memory until no thread can run until another thread has been swapped out; with the consequence that all your cpu time is spent switching threads and performing swapping IO. your system will simply grind to a halt.

    • So please, ignore any suggestions to measure cpu and employ busy loops and wait loops to control cpu utilisation.

      They might succeed in consuming 100% of your cpu, BUT THEY WILL NOT IMPROVE YOUR WORK RATE OR CUT YOUR PROGRAMS RUNTIME.

    trying to figure out why the CPU utilization is not staying constant ...

    In the first half of your program, less than 100% utilisation is to be expected as a) you are doing disk IO, reading the XML files from disk; b) writing to your shared %similar hash which will involve locking -- whether you are doing it; which you do not show -- or just using the implicit locking that perl does internally for its own protection.

    In the second half of your program, you are a) iterating over that same shared hash, which means that internal locking comes into play. (But that's probably not a big problem for that part of your program).

    Much worse is that each of your second pool of threads is performing the same bi-gramming process on each of the keys of the shared hash, for each work item it receives. AND EACH OF THOSE WORK ITEMS IS ITSELF, ONE OF THE KEYS OF THAT HASH!

    That means you are doing a O(n^2) process of comparing each key against each key, and bi-graming both keys again every time. That is monumentally wasteful.

    If you fix that algorithm -- ie. bi-gram each of the keys in the hash once first -- and then compare them each against the next, a single thread will probably finish much faster than your current mechanism. At which point, all the talk of dynamically spawning threads according to cpu load becomes moot.

    Once you've you have made your similarity algorithm implementation efficient -- single-threaded -- it might then be worth considering whether if can be sped up using threading. But for now, ditch your second thread pool, use a single threaded loop to construct bi-gram arrays for each of your keys; and then use a second loop to compare one against the other.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      Thanks for the feedback with the bi-gramming. It turns out, I can simply store the bi-gram array directly as part of the hash...I will give that a shot later today and see how it does.

        it turns out, I can simply store the bi-gram array directly as part of the hash..

        That will work, but be aware that accessing the data in a shared data structure is significantly slower than accessing non-shared memory. This is due to the need for internal locking amongst other things.

        So if you run a second pool of threads, each accessing that shared hash containing shared arrays, the lock contention between the threads will likely slow your processing to a crawl. (A simple fact of life with perl's shared data structures).

        Try the version I posted and see how you get on.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

Re: dynamic number of threads based on CPU utilization
by sundialsvc4 (Monsignor) on Sep 26, 2012 at 15:47 UTC

    Remember, the work being done here is I/O-bound, not CPU bound.   This means that the processor is spending nearly all of its time getting the next I/O-operation started, then everyone goes to sleep again until the next I/O is complete.   Therefore, CPU utilization would not be expected to be a useful bellwether of how much work is now being done; or could potentially be done.   If anything is to get saturated, it’s most likely to be I/O capacity.   (Unless you are blowing memory with some too-big-for-its-britches hash and thus dropping into thrashing-hell; dunno.)

    Perhaps you could consider writing this program so that it simply is given a directory-name as an input parameter and it munches through that directory and its subs, doing its thing, then writes the completed output (say...) to a shared SQLite database file.   Now, the job can be done, appropriate to each machine and to the changing workload, simply by launching multiple copies of the program simultaneously from the command-line with different parameters.   This would achieve the same goal ... of exploiting parallelism ... but with considerable reduction of internal complexity and handing more influence back to the user.   A command-line parameter to “consider only newer files,” etc., might be useful options.

      Remember, the work being done here is I/O-bound, not CPU bound.

      Why are you doing this?

      For 90%+ of the runtime of the OPs program, IT IS CPU BOUND NOT IO BOUND.

      So do just stop regurgitating your useless, pointless, irrelevant, and gratuitously incorrect home-spun wisdoms.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

        my apologies...I thought that since the procXml sub worked just fine, it would not be relevant to the discussion or potential solution. Within the procXml sub, I simply slurp the file into a hash, then operate on the hash.

        I was under the impression that because I was operating on the file contents in memory (i.e. the hash), it was a mostly CPU-bound process (minus slurping the input file and printing to the output file.

        When a CPU runs in terms of literally billions of ops per second these days, and it is drawing its inputs from a large number of files, then ... it is I/O-bound because that’s what it is waiting on.   Nanoseconds vs. milliseconds.   The completion-time of this program, over the course of let us say one minute, will chiefly be regulated by its ability to perform input/output, not by the speed of the processor(s).   If you were to place the program onto a CPU that ran twice as fast, all other things being equal, such a program would not complete in half the time.   If it truly were CPU-bound, then it would not “slow down,” a-n-d drop out of CPU-utilization at the same time, as it is reported to be doing.

        As you say in the (upvoted) earlier comment, this is a poorly thought-out program from the start.   I would further guess that the hash might well have become enormous by that time, and that quite possibly the program has descended into “thrashing hell.”   Something, and it can only be I/O, is utterly preventing the CPU from getting any work done during the second phase.   Thrashing is about the only culprit that exists to explain that.

Re: dynamic number of threads based on CPU utilization
by BrowserUk (Pope) on Sep 26, 2012 at 16:23 UTC

    This compiles clean, but is of necessity untested.

    Substitute your procXML() routine and it should come close to doing the same thing(*) as the code you posted, but rather more quickly:

    *But verify carefully that I've refactored it correctly!

    #!/usr/bin/perl use strict; use warnings; use Carp qw(carp cluck croak confess); use XML::Hash; use File::Slurp; use Date::Parse; binmode STDOUT, ":utf8"; use threads; use threads::shared; use Thread::Queue; use Sys::CPU; use Devel::Size qw(size total_size); use List::MoreUtils qw(uniq); #use Data::Dumper; local $| = 1; print `/bin/date`."\n"; our $THREADS = Sys::CPU::cpu_count()*2; my $dir='/xmlFeeds'; my ($DIR,@files); opendir($DIR,$dir); foreach(readdir($DIR)) { push @files, $_ if $_ =~ m/.*\.xml/; } closedir($DIR); my $outFile='./out.nt'; my $OUTFILE; open($OUTFILE,'>:utf8',$outFile); my %similar :shared; my $recordCount :shared; $recordCount=1; my $Qwork = new Thread::Queue; ## Create the pool of workers my @pool = map{ threads->create( \&worker, $Qwork ) } 1 .. $THREADS; $Qwork->enqueue(@files); ## Tell the workers there are no more work items $Qwork->enqueue( (undef) x $THREADS ); ## Clean up the threads $_->join for @pool; my @doms = keys %similar; ## get keys into non-shared space for speed my %bigrams; for my $dom ( @doms ) { undef @{ $bigrams{ $dom } }{ uniq( unpack '(A2)*', $dom ) }; } for my $dom1 ( @doms ) { my $type = $similar{ $dom1 }; my $cDom1 = keys %{ $bigrams{ $dom1 } }; for my $dom2 ( @doms ) { next if $dom1 eq $dom2; my $innerType = $similar{ $dom2 }; my $cDom2 = keys %{ $bigrams{ $dom2 } }; my $counter = grep{ exists $bigrams{ $dom1 }{ $_ } } keys %{ $bigrams{ $dom2 } }; my $value = ( $counter * 2 ) / ( $cDom1 + $cDom2 ); if( $value >= 0.9 ) { my $triple .= qq|<http://cs.org/$type#$dom1> <http://cs.or +g/p/similarName> <http://cs.org/$innerType#$dom2> .\n|; print $triple; print $OUTFILE $triple; } } } close($OUTFILE); print `/bin/date`."\n"; sub worker { my $tid = threads->tid; my( $Qwork ) = @_; while( my $file = $Qwork->dequeue ) { my $triple = procXml($file); print $OUTFILE $triple if defined $triple; } } sub procXml { [code here] }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      Thanks for this! I will try this out shortly. I have to run to a meeting, but will try it out after that.

      This brings me to another question: assuming that my "similar" hash would be populated with anywhere from several hundred thousand to a milion or more key value pairs...is there a better way to tackle this? I am working on a blade server with 24 physical CPU's and more than 500gb of RAM...I must be able to determine the similarity metric (keeping only those that are a 90% or better match) of every single key to every other key. Given those resources and requirements...what are your thoughts?

        several hundred thousand to a milion or more key value pairs

        How big are the keys and values on average? And how big are the xml files on average?

        I am working on a blade server with 24 physical CPU's and more than 500gb of RAM

        Is the blade server set up as a single SMP system? How many cores/threads per cpu?

        Given those resources and requirements...what are your thoughts?

        I'd want to see the answers to the above questions before reaching any conclusions about how I would go about tackling the problem.

        Sight (public or private) of a typical example of the XML input and the keys/value pairs derived from it would help.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

Re: dynamic number of threads based on CPU utilization
by bulk88 (Priest) on Sep 27, 2012 at 03:34 UTC
    I would use Time::HiRes and start putting time counters in worker*() subs. "$OUTFILE" is being used by multiple threads. I dont know how bad the performance will be of using $OUTFILE from multiple threads but it doesn't look good. Regarding shared variables. They are implemented by locks, and a "master" central variable in a secret ithread/interp.

    If you can not reduce the number of reads and writes or rethink your algorithm/use of threads at all (other suggestions in this thread), I suggest outright *locking* %similar for duration of "foreach my $odom(keys %similar) {" loop in simAlg, or locking %similar and then copying %similar to a my %local_similar and doing the loop on %local_similar, copy %local_similar back to %similar and release the lock on %similar. If your algorithm needs allow, you could copy %similar to %local_similar and never look at %similar for the rest of the job item.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://995783]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2014-07-11 02:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (217 votes), past polls