Re^5: Thread sharing a bit vector??

Thank you both for your feedback.

My program really is a re-vamp of something that was heavily string-procesing based and is well entrenched in Perl. Moving to something like C++ will definitely take some time and not likely to happen for a while. I have thought about using a relational database as a potentially viable option, so that may be revisited. I initially was thinking MySQL, but after some consulting, PostgreSQL was the choice. I'm glad to see that's your recommendation as well.

The use of the Bit::Vector library was primarily to reduce the memory requirement of holding these strings in memory and as a faster way to process these comparisons. Using some data juggling methods and heavy use of threads (w/o inter-thread communication), the first incarnation of my program was able to run to completion on a 1TB, 40-core machine in about 10 days, essentially using a whole bunch of strings, string ops, and hashes. Without the data juggling methods, I would exceed the 1TB of RAM. Using bit-vector encoding, I can reduce my memory consumption by ~32 fold on a 64-bit machine. And the bit vector operations are fast and do a lot of what I need it to do.

It is unfortunate that the bit vectors cannot be shared, however I do have work-arounds for this, although it is not as efficient as updating a shared variable. If updating a globally shared variable is too expensive (or not possible, in this case), I basically build up a thread-local variable and then dump it to disk with store() before joining the main thread. Kind of like the Parallel::ForkManager library does.

It is true that I have had to work through some (what I think were) contention issues when updating globally-shared variables, however for the most part, I've solved that. Most of my repeated ops on globally shared variables were reads anyway. Hopefully the repeate read attempts don't cause some sort of passive contention as Perl attempts to prevent concurrent access...

Currently, I set up each thread to load a thread-specific portion of the entire dataset (part of the OUTER loop). Then, one-by-one, each thread will load up portions of the rest of the dataset from disk (part of the INNER loop) and do its comparisons. I have to store these results in a thread-local datastructure until that thread has completed all its work, *then* update the globally shared variable just before the thread joins the main thread. This reduced lock contention significantly and allowed me to fully utilize all the (hyperthreaded) cores on my machine.

Here's a mix of code w/pseudocode that I hope will explain some of the specifics of this set_intersection() component:

my %completed      :shared = ();        # Record completed datast
my %results_global :shared = ();        # Global; holds intersections
my @jobs           :shared = ();        # Holds thread IDs

# Since I know what to call the different subsets up front, I'll pre-
# share them. In my experience, this is *much* faster than calling
# shared_clone({}) for hash values that are deeply branching HREFs.
foreach my $outer (@data_entire) {
    $results{$outer} = &share({});
    foreach my $inner (@data_entire) {
        $results{$outer}->{$inner} = &share([]);    # Hold AREFs
    }
} #OUTER
[download]

# Partition data across 3 threads (4 AREFs/ea) and spawn thread jobs.
# Each subset is an AREF of indices in @data_entire.
my @data_entire :shared = ($aref1, $aref2, ... $aref12);
my @partitions  :shared = ($subset1, $subset2, $subset3);
foreach(@partitions) {
    my $tid = threads->new(\&XYZ, $_);              # Create thread
    push(@jobs, $tid);                              # Store $tid
}
$_->join() foreach (@jobs);

                                                                     
===================== In a thread-local sub() =======================
sub XYZ {

    my @data_subset = @data_entire[@{ $_ }];         # thread's chunk
    my %results = ();
    OUTER: foreach $outer (@data_subset) {
        retrieve $outer;
        INNER: foreach $inner (@data_entire) {
            next INNER if($inner == $outer)          # skip self
            next INNER if(exists $completed{$inner}; # already done
    
            retrieve $inner;                         # EXPENSIVE
            
            # Return two AREFs
            my ($result_oi, $result_io) 
                = set_intersection($outer, $inner);

            # Store to thread-local variable
            $results_local{$outer}->{$inner} = $result_oi;
            $results_local{$inner}->{$outer} = $result_io;
    
        }
        {
            lock(%completed);                        # Update global
            $completed{$outer} = ();
        }
    }
    
    # By updating the globally shared hash of the local results all 
    # at once and just before this spawned thread returns to the main
    # thread, the write contention time to %results_global is reduced
    # significantly. In practice, this made a HUGE difference and 
    # allowed me to use 80 hyperthreaded cores @ 100% (40 physical),
    # whereas updating %results_global after each set_intersection()
    # reduced it to maybe 13 cores @ 100% usage (this is on a single
    # memory machine).
    #
    # Update globally shared hash of results with local results
    {
      lock (%results_global);
      foreach my $outer (keys %results_local) {
        $results_global{$outer}->{$_} = $results_local{$outer}->{$_} 
          foreach (keys %{ $results_local{$outer} });
      } #OUTER
    }
    
    return;
}
=====================================================================
[download]

The main reason why I wanted to keep these bit vectors in memory is that I would obviate Storable's retrieve() calls. It is fast, but not fast enough when I have (as BrowserUK had pointed out) >6M intersections to perform. SIDE NOTE: I can convert an array of 4 million, 50-bit bit vectors to HEX strings with ->to_Hex(), store() them, retrieve() them, reconstruct the bit vectors from the HEX strings with ->new_Hex() in much less time that it takes to simply store() the bit vector datastructure itself.

Just to clarify, each of these 3500 arrays contain up to 25,000,000 bit vectors a piece. The worst-case scenario for my (serial) set_intersection() op is to perform 2*(25,000,000 + 25,000,000)-1 =~ 100,000,000 bit vector comparisons, which took 14 sec on a single 1.734Gz Intel Core i7 Q820.

(6,125,000 comparisons)*(14 sec/comparison) = 85,750,000 sec

Assuming the performance of the set_intersection() on one core is the same as on an 80-core single memory machine (though there will proabably be lots of CPU cache misses when each of the 80 cores is loading up a large datafile):

    (85,750,000 sec)/(80 cores)*(1hr/3600sec)*(1day/24hrs) 
        = 1,071,875 sec/core
        =       298 hrs/core
        =        12 days/core       <--- Walltime for this op.
[download]

This is in line with past benchmarks. Obviously the worst case scenario won't be propagated throughout the entire dataset as I do have some arrays that contain only ~40,000 bit vectors, but I like to see an upper limit to know what I'm up against.

This estimation does not include the additional time needed to retrieve() the partitions from disk. So while sharing the data across multiple threads may not decrease the set_intersection() times themselves, it would eliminate the disk load times -- which are considerable -- and effectively limit how many threads I can use before disk thrashing becomes an issue. Without multithreading or some sort of parallelization, though, this 85,750,000 sec is equivalent to ~2.75 years on a single core, not to mention the disk load times. @_@

I have contacted the developer and am waiting to hear a response. Hopefully, there will be enthusiasm to work on this, because C++ is a heartbeat away.

In any case, thanks for the continued feedback.

Comment on Re^5: Thread sharing a bit vector?? Select or Download Code

Replies are listed 'Best First'.
Re^6: Thread sharing a bit vector?? by BrowserUk (Patriarch) on Apr 13, 2012 at 01:41 UTC
Do you need to do this once or many times? Either way, set this going with your current solution & data NOW! Once it is running, then explore further possibilities. If: a) you are going to do this many times in the future; b) you can tolerate a delay (>1 month) before getting a good solution: c) have sufficient C knowledge to help me port and extend my XS bit-vector code to whatever platform you are running on; then email me. (See my home node.) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^7: Thread sharing a bit vector?? by traceyfreitas (Sexton) on Apr 13, 2012 at 11:00 UTC
I will be running this only as often as practical. Meaning if it takes 2 weeks to produce the data I need, then it would be run every 2 months or longer. If it could produce it in, say, just a few days, then I would re-run it with updated data at least every month. The data set that this program chews on grows every month, and I foresee this program to be used for years to come. I work with x86_64 machines (linux) and it'll be about a week before I can get the program completely re-tooled to use Bit::Vector. I shall email you!	[reply]


Perl-Sensitive Sunglasses
	PerlMonks