Beefy Boxes and Bandwidth Generously Provided by pair Networks BBQ
Don't ask to ask, just ask
 
PerlMonks  

Re^2: dynamic number of threads based on CPU utilization

by mabossert (Sexton)
on Sep 27, 2012 at 01:03 UTC ( #995893=note: print w/ replies, xml ) Need Help??


in reply to Re: dynamic number of threads based on CPU utilization
in thread dynamic number of threads based on CPU utilization

This brings me to another question: assuming that my "similar" hash would be populated with anywhere from several hundred thousand to a milion or more key value pairs...is there a better way to tackle this? I am working on a blade server with 24 physical CPU's and more than 500gb of RAM...I must be able to determine the similarity metric (keeping only those that are a 90% or better match) of every single key to every other key. Given those resources and requirements...what are your thoughts?


Comment on Re^2: dynamic number of threads based on CPU utilization
Re^3: dynamic number of threads based on CPU utilization
by BrowserUk (Pope) on Sep 27, 2012 at 01:50 UTC
    several hundred thousand to a milion or more key value pairs

    How big are the keys and values on average? And how big are the xml files on average?

    I am working on a blade server with 24 physical CPU's and more than 500gb of RAM

    Is the blade server set up as a single SMP system? How many cores/threads per cpu?

    Given those resources and requirements...what are your thoughts?

    I'd want to see the answers to the above questions before reaching any conclusions about how I would go about tackling the problem.

    Sight (public or private) of a typical example of the XML input and the keys/value pairs derived from it would help.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      the server is setup as a a single SMP with each CPU being single core and thread, the values average 10 bytes, the keys are roughly 20-30 bytes on average and the XML files vary wildly between 1MB and 110MB. Unfortunately, I can't share the XML files as they belong to a customer and contain proprietary information.

        How many key/value pairs come out of each xml document?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

        Okay. This is what I think I would try given your situation:

        #! perl -slw use strict; use threads; use threads::shared; use threads::Q; ## Self-limiting sized queues. use Time::HiRes qw[ time ]; sub uniq{ my %h; undef @h{ @_ }; keys %h } my $start = time; our $Q //= 100; ## The maximum size of the Qs our $T //= 4; ## No of worker threads at both stages our $F //= '*.txt'; ## File selector for testing my $semFIO :shared; ## serialise disk access my $Qxmlin = threads::Q->new( $Q ); ## filenames to XML threads my $Qitems = threads::Q->new( $Q ); ## extracted items from XML thread +s async { ## "XML Parser" thread pool while( my $file = $Qxmlin->dq ) { my $xml = do{ lock $semFIO; local( @ARGV, $/ ) = $file; <> }; $Qitems->nq( join $;, split ' : ', $_ ) for split "\n", $xml; } $Qitems->nq( undef ); }->detach for 1 .. $T; async { ## Q up filenames for XML processing pool $Qxmlin->nq( glob "sha/$F" ); $Qxmlin->nq( (undef) x $T ); }->detach; my @items; ## Non-shared storage for extracted items for( 1 .. $T ) { ## Gather them all together push @items, $_ while defined( $_ = $Qitems->dq ); } undef $Qxmlin; undef $Qitems; ## Done with these. print STDERR scalar @items; ## Sanity check of items count. my $Qcmp = threads::Q->new( $Q ); ## Item pairs for comparison in my $Qsim = threads::Q->new( $Q ); ## Similar rated items out async { ## similarities assessment thread pool while( my $work = $Qcmp->dq ) { my( $key1, $val1, $key2, $val2 ) = split $;, $work; my %bigrams1; undef @bigrams1{ uniq unpack '(A2)*', $key1 }; my @bigrams2 = uniq unpack '(A2)*', $key2; my $count = grep exists( $bigrams1{ $_ } ), @bigrams2; my $sim = $count * 2 / ( keys( %bigrams1 ) + @bigrams2 ); $Qsim->nq( $work ) if $sim > 0.2; ## low-value required to al +low my test data to generate hits. } $Qsim->nq( undef ); }->detach for 1 .. $T; async { ## Q up the pairs for comparison for my $i1 ( 0 .. $#items ) { my $item1 = $items[ $i1 ]; for my $i2 ( $i1 + 1 .. $#items ) { my $item2 = $items[ $i2 ]; $Qcmp->nq( "$item1$;$item2" ); } } $Qcmp->nq( (undef) x $T ); }->detach; ## gather together those that pass the criteria ## And do something with them. for( 1 .. $T ) { while( my $sim = $Qsim->dq ) { print $sim; } } printf STDERR "With T:$T Q:$Q took %.3f s\n", time - $start;

        The comments in this code -- which simulates your XML files using simple flat text files of sha256_hex keys and a number from which the sha256 was derived -- are sparse; and the questions probably many. Easier to answer your actual questions than try and guess what they might be and answer them in comments.

        There are two tunable parameters -- the size of the queues; which should be at least double the size of the pools -- and the size of the pools (threads). I've used the same numbers for both halves of the program, but you might want to try using different values and tuning them separately.

        Take a look -- ask whatever questions arise :)


        It uses my own threads::Q implementation of a self-limiting (size) queue:

        package threads::Q; use strict; use warnings; use threads::shared; our $VERSION = '1.1.5'; sub new { my( $package, $size ) = @_; my @Q :shared; return bless [ \@Q, $size ], $package; } sub nq { my( $Q, $size ) = @{ shift() }; lock @$Q; for( @_ ) { cond_wait @$Q while @$Q > $size; push @$Q, $_; cond_signal @$Q; } return; } sub dq { my( $Q, $size ) = @{ shift() }; lock @$Q; cond_wait @$Q until @$Q; cond_signal @$Q; shift @$Q } sub cq { my( $Q, $size ) = @{ shift() }; scalar @$Q; } sub dq_nb { my( $Q, $size ) = @{ shift() }; return unless @$Q; lock @$Q; shift @$Q; }

        That is pretty well exercised -- except the q_nb() which I've never had occasion to use for real -- but is undocumented, hence not on CPAN; new() takes a single argument that is the maximum number of items the Q can hold at any given time. dq() is dequeue(); nq() is enqueue(); cq() (see-queue) is pending().

        Its benefit is that it controls the unrestricted growth that can occur when the producer runs faster than the consumer. To its possible debit is that it doesn't accept anything other than scalars or references to pre-shared structures. I consider this a positive as in my experiments it is faster and far less memory hungry to pass compound data joined into a scalar and split it in the consumer, than to copy data in to a shared structure and copy it back again.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://995893]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2014-04-19 12:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (480 votes), past polls