Perl-Sensitive Sunglasses | |
PerlMonks |
Re^5: Thread sharing a bit vector??by traceyfreitas (Sexton) |
on Apr 13, 2012 at 01:04 UTC ( [id://964847]=note: print w/replies, xml ) | Need Help?? |
Thank you both for your feedback. My program really is a re-vamp of something that was heavily string-procesing based and is well entrenched in Perl. Moving to something like C++ will definitely take some time and not likely to happen for a while. I have thought about using a relational database as a potentially viable option, so that may be revisited. I initially was thinking MySQL, but after some consulting, PostgreSQL was the choice. I'm glad to see that's your recommendation as well. The use of the Bit::Vector library was primarily to reduce the memory requirement of holding these strings in memory and as a faster way to process these comparisons. Using some data juggling methods and heavy use of threads (w/o inter-thread communication), the first incarnation of my program was able to run to completion on a 1TB, 40-core machine in about 10 days, essentially using a whole bunch of strings, string ops, and hashes. Without the data juggling methods, I would exceed the 1TB of RAM. Using bit-vector encoding, I can reduce my memory consumption by ~32 fold on a 64-bit machine. And the bit vector operations are fast and do a lot of what I need it to do. It is unfortunate that the bit vectors cannot be shared, however I do have work-arounds for this, although it is not as efficient as updating a shared variable. If updating a globally shared variable is too expensive (or not possible, in this case), I basically build up a thread-local variable and then dump it to disk with store() before joining the main thread. Kind of like the Parallel::ForkManager library does. It is true that I have had to work through some (what I think were) contention issues when updating globally-shared variables, however for the most part, I've solved that. Most of my repeated ops on globally shared variables were reads anyway. Hopefully the repeate read attempts don't cause some sort of passive contention as Perl attempts to prevent concurrent access... Currently, I set up each thread to load a thread-specific portion of the entire dataset (part of the OUTER loop). Then, one-by-one, each thread will load up portions of the rest of the dataset from disk (part of the INNER loop) and do its comparisons. I have to store these results in a thread-local datastructure until that thread has completed all its work, *then* update the globally shared variable just before the thread joins the main thread. This reduced lock contention significantly and allowed me to fully utilize all the (hyperthreaded) cores on my machine. Here's a mix of code w/pseudocode that I hope will explain some of the specifics of this set_intersection() component:
The main reason why I wanted to keep these bit vectors in memory is that I would obviate Storable's retrieve() calls. It is fast, but not fast enough when I have (as BrowserUK had pointed out) >6M intersections to perform. SIDE NOTE: I can convert an array of 4 million, 50-bit bit vectors to HEX strings with ->to_Hex(), store() them, retrieve() them, reconstruct the bit vectors from the HEX strings with ->new_Hex() in much less time that it takes to simply store() the bit vector datastructure itself. Just to clarify, each of these 3500 arrays contain up to 25,000,000 bit vectors a piece. The worst-case scenario for my (serial) set_intersection() op is to perform 2*(25,000,000 + 25,000,000)-1 =~ 100,000,000 bit vector comparisons, which took 14 sec on a single 1.734Gz Intel Core i7 Q820. (6,125,000 comparisons)*(14 sec/comparison) = 85,750,000 secAssuming the performance of the set_intersection() on one core is the same as on an 80-core single memory machine (though there will proabably be lots of CPU cache misses when each of the 80 cores is loading up a large datafile):
This is in line with past benchmarks. Obviously the worst case scenario won't be propagated throughout the entire dataset as I do have some arrays that contain only ~40,000 bit vectors, but I like to see an upper limit to know what I'm up against. This estimation does not include the additional time needed to retrieve() the partitions from disk. So while sharing the data across multiple threads may not decrease the set_intersection() times themselves, it would eliminate the disk load times -- which are considerable -- and effectively limit how many threads I can use before disk thrashing becomes an issue. Without multithreading or some sort of parallelization, though, this 85,750,000 sec is equivalent to ~2.75 years on a single core, not to mention the disk load times. @_@ I have contacted the developer and am waiting to hear a response. Hopefully, there will be enthusiasm to work on this, because C++ is a heartbeat away. In any case, thanks for the continued feedback.
In Section
Seekers of Perl Wisdom
|
|