Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

So I put together the reference code for the hackathon that's going on. One version uses threading, and another version does not. The purpose of the code is to detect (and optionally delete) duplicate files in a given directory tree. You can watch the code in action in a small video here.

You'd be right to ask why I'd want to make this multi-threaded when it has to do with IO. The answer is quite simply that after you've eliminated obvious non-dupes, you have to start comparing the files with a real means of differentiation, namely calculating file digests--a cpu-intensive task that comes after you're done with the IO. I use xxhash for this and I was hoping that the threading would help me concurrently do the heavy lifting in that area, spreading the number crunching across 8 cpu cores. Surely that would make it faster, right?

The code is on github here. The code isn't glorious, and in a few ways I'm unhappy with it but time constraints keep me from moving what should be modular over into a module and refactoring the code to call out to that. If I can't sleep tonight maybe I'll do just that, but...

As expected, the threaded version uses more RAM, however what is unexpected is that it gains barely any advantage over the non-threaded code in speed. In short executions, it's actually slower (I guess that's the thread management overhead). In very long executions of the code, I'd expect to see more of a boost, but the boost just isn't there. By using concurrency I thought I'd gain more of an advantage in speed, but this just isn't the case. I'm asking folks who know more about threading if they could tell me what I'm doing wrong.

I've considered the possibility that something must be up with the underlying disk storage. IO blocking for example. Well I can confirm via sar/iostat that both versions of the code push the raid10 array to maximum expected performance levels. I'd like to believe that that's all there is to it, but I get the same lack of "boost" when I run the code on a ramdisk. Seek time and IO blocking become irrelevant in such a scenario. And yet, still no performance gains with threads.

The code is too big to put into a post without breaking all manner of web etiquette laws, but it isn't hard to grok if you open it in vim and fold on the subs. You can see it all laid out. There's the sub that creates the thread pool, a "worker" sub, and a sub to wait on and end the threads in the pool.

So while I start nytprof'ing the code, could you take a peek and let me know if it looks like I'm making any _obvious_ mistakes? Any insight is appreciated, and I thank you in advance for your suggestions.

A mistake can be valuable or costly, depending on how faithfully you pursue correction

In reply to Threaded Code Not Faster Than Non-Threaded -- Why? by Tommy

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    Domain Nodelet?
    and the web crawler heard nothing...

    How do I use this?Last hourOther CB clients
    Other Users?
    Others imbibing at the Monastery: (6)
    As of 2024-07-24 21:04 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found

      erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.