|Perl: the Markov chain saw|
Threaded Code Not Faster Than Non-Threaded -- Why?by Tommy (Chaplain)
|on Jan 05, 2014 at 02:08 UTC||Need Help??|
Tommy has asked for the
wisdom of the Perl Monks concerning the following question:
So I put together the reference code for the dfw.pm.org hackathon that's going on. One version uses threading, and another version does not. The purpose of the code is to detect (and optionally delete) duplicate files in a given directory tree. You can watch the code in action in a small video here.
You'd be right to ask why I'd want to make this multi-threaded when it has to do with IO. The answer is quite simply that after you've eliminated obvious non-dupes, you have to start comparing the files with a real means of differentiation, namely calculating file digests--a cpu-intensive task that comes after you're done with the IO. I use xxhash for this and I was hoping that the threading would help me concurrently do the heavy lifting in that area, spreading the number crunching across 8 cpu cores. Surely that would make it faster, right?
The code is on github here. The code isn't glorious, and in a few ways I'm unhappy with it but time constraints keep me from moving what should be modular over into a module and refactoring the code to call out to that. If I can't sleep tonight maybe I'll do just that, but...
As expected, the threaded version uses more RAM, however what is unexpected is that it gains barely any advantage over the non-threaded code in speed. In short executions, it's actually slower (I guess that's the thread management overhead). In very long executions of the code, I'd expect to see more of a boost, but the boost just isn't there. By using concurrency I thought I'd gain more of an advantage in speed, but this just isn't the case. I'm asking folks who know more about threading if they could tell me what I'm doing wrong.
I've considered the possibility that something must be up with the underlying disk storage. IO blocking for example. Well I can confirm via sar/iostat that both versions of the code push the raid10 array to maximum expected performance levels. I'd like to believe that that's all there is to it, but I get the same lack of "boost" when I run the code on a ramdisk. Seek time and IO blocking become irrelevant in such a scenario. And yet, still no performance gains with threads.
The code is too big to put into a post without breaking all manner of web etiquette laws, but it isn't hard to grok if you open it in vim and fold on the subs. You can see it all laid out. There's the sub that creates the thread pool, a "worker" sub, and a sub to wait on and end the threads in the pool.
So while I start nytprof'ing the code, could you take a peek and let me know if it looks like I'm making any _obvious_ mistakes? Any insight is appreciated, and I thank you in advance for your suggestions.
A mistake can be valuable or costly, depending on how faithfully you pursue correction