Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: dynamic number of threads based on CPU utilization

by BrowserUk (Pope)
on Sep 26, 2012 at 16:23 UTC ( #995809=note: print w/ replies, xml ) Need Help??


in reply to dynamic number of threads based on CPU utilization

This compiles clean, but is of necessity untested.

Substitute your procXML() routine and it should come close to doing the same thing(*) as the code you posted, but rather more quickly:

*But verify carefully that I've refactored it correctly!

#!/usr/bin/perl use strict; use warnings; use Carp qw(carp cluck croak confess); use XML::Hash; use File::Slurp; use Date::Parse; binmode STDOUT, ":utf8"; use threads; use threads::shared; use Thread::Queue; use Sys::CPU; use Devel::Size qw(size total_size); use List::MoreUtils qw(uniq); #use Data::Dumper; local $| = 1; print `/bin/date`."\n"; our $THREADS = Sys::CPU::cpu_count()*2; my $dir='/xmlFeeds'; my ($DIR,@files); opendir($DIR,$dir); foreach(readdir($DIR)) { push @files, $_ if $_ =~ m/.*\.xml/; } closedir($DIR); my $outFile='./out.nt'; my $OUTFILE; open($OUTFILE,'>:utf8',$outFile); my %similar :shared; my $recordCount :shared; $recordCount=1; my $Qwork = new Thread::Queue; ## Create the pool of workers my @pool = map{ threads->create( \&worker, $Qwork ) } 1 .. $THREADS; $Qwork->enqueue(@files); ## Tell the workers there are no more work items $Qwork->enqueue( (undef) x $THREADS ); ## Clean up the threads $_->join for @pool; my @doms = keys %similar; ## get keys into non-shared space for speed my %bigrams; for my $dom ( @doms ) { undef @{ $bigrams{ $dom } }{ uniq( unpack '(A2)*', $dom ) }; } for my $dom1 ( @doms ) { my $type = $similar{ $dom1 }; my $cDom1 = keys %{ $bigrams{ $dom1 } }; for my $dom2 ( @doms ) { next if $dom1 eq $dom2; my $innerType = $similar{ $dom2 }; my $cDom2 = keys %{ $bigrams{ $dom2 } }; my $counter = grep{ exists $bigrams{ $dom1 }{ $_ } } keys %{ $bigrams{ $dom2 } }; my $value = ( $counter * 2 ) / ( $cDom1 + $cDom2 ); if( $value >= 0.9 ) { my $triple .= qq|<http://cs.org/$type#$dom1> <http://cs.or +g/p/similarName> <http://cs.org/$innerType#$dom2> .\n|; print $triple; print $OUTFILE $triple; } } } close($OUTFILE); print `/bin/date`."\n"; sub worker { my $tid = threads->tid; my( $Qwork ) = @_; while( my $file = $Qwork->dequeue ) { my $triple = procXml($file); print $OUTFILE $triple if defined $triple; } } sub procXml { [code here] }

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

RIP Neil Armstrong


Comment on Re: dynamic number of threads based on CPU utilization
Select or Download Code
Re^2: dynamic number of threads based on CPU utilization
by mabossert (Beadle) on Sep 26, 2012 at 16:41 UTC

    Thanks for this! I will try this out shortly. I have to run to a meeting, but will try it out after that.

Re^2: dynamic number of threads based on CPU utilization
by mabossert (Beadle) on Sep 27, 2012 at 01:03 UTC

    This brings me to another question: assuming that my "similar" hash would be populated with anywhere from several hundred thousand to a milion or more key value pairs...is there a better way to tackle this? I am working on a blade server with 24 physical CPU's and more than 500gb of RAM...I must be able to determine the similarity metric (keeping only those that are a 90% or better match) of every single key to every other key. Given those resources and requirements...what are your thoughts?

      several hundred thousand to a milion or more key value pairs

      How big are the keys and values on average? And how big are the xml files on average?

      I am working on a blade server with 24 physical CPU's and more than 500gb of RAM

      Is the blade server set up as a single SMP system? How many cores/threads per cpu?

      Given those resources and requirements...what are your thoughts?

      I'd want to see the answers to the above questions before reaching any conclusions about how I would go about tackling the problem.

      Sight (public or private) of a typical example of the XML input and the keys/value pairs derived from it would help.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      RIP Neil Armstrong

        the server is setup as a a single SMP with each CPU being single core and thread, the values average 10 bytes, the keys are roughly 20-30 bytes on average and the XML files vary wildly between 1MB and 110MB. Unfortunately, I can't share the XML files as they belong to a customer and contain proprietary information.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://995809]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (11)
As of 2014-09-23 09:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (216 votes), past polls