While the multithreaded code works, it's about 50% slower than the first one

The problem comes in two parts:

  1. As Eily already pointed out; if the two files reside on the same (physical) drive, then interleaving reads to the two files means that on average, the heads will need to do twice as many track-to-track seeks as when you read the sequentially.

    As track-to-track seek time is the slowest part of reading a disk; that means that it dominates; with the net result that you actually slow things down.

    If the files are (can be arranged to be without necessitating moving one of them), on different physical devices, much of the additional overhead is negated. If those drives are connected by separate interfaces so much the better.

  2. The second part is more insidious. When you return a hashref from a subroutine as you are doing:
    return \%hash;

    Normally, without threads, that is a very efficient operation necessitating just the copying of a reference.

    But with the threads memory model; only data that is explicitly shared can be transferred between threads. Normally, this is a good thing preventing unintended shared accesses and all the problems that can arise from them; but in this case, it means that the entire hash gets effectively duplicated; and thus is a relatively slow process for large hashes.

    This is especially annoying when the transfer is the last act of the originating thread; as once the original hash has been duplicated, it is then discarded; so there would be no risk from just transferring the reference.

    I did once look into whether threads could be patched to avoid the duplication for this special case; but the entire memory management of perl; and especially under threading; is so complex and opaque that I quickly gave up.

There is a possible, but nascent, unproven and unpublished possible solution to the latter part of the problem. I've recently written several hash-like data-structures using Inline::C that bypass Perl's memory management entirely and allocate their memory from the CRT heap. As all that perl sees of these structures is an opaque RV pointing to a UV, it should be possible to pass one of these references between threads without Perl interfering and needing to duplicate them.

But, the ones that would be applicable to your use were only developed as far as needed to prove that they weren't useful for my purposes and then abandoned whilst I developed my solution which whilst hash-like; is very specialised for my requirements and not useful as a general purpose hash (no iterators or deletions); and I don;t have the time to finish any of the more general ones.

If you have C/XS skills and your need is pressing enough, I could give you what I have for you to finish.

Of course, that would only help if you can arrange for your two files to be on different disks in order to solve or mitigate part 1 of the problem.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

In reply to Re: Using threads to process multiple files by BrowserUk
in thread Using threads to process multiple files by anli_

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":