anli_ has asked for the wisdom of the Perl Monks concerning the following question:

Hey. I'm trying to get the feel for how multithreading works with perl in an effort to try to speed up my scripts. My test case is that I want to parse two relatively large files and place all the lines from each file into separate hashes. Normally I would just go over them one by one with a while loop and populate my hashes that way. Like so.
open IN, '<', $ARGV[0]; my %hash1; while (<IN>) { next unless ( $_ =~ m/^\@HWI/ ); my ($header) = split(/ /, $_); $hash1{$header} = 1; } close IN; open IN2, '<', $ARGV[1]; my %hash2; while (<IN2>) { next unless ( $_ =~ m/^\@HWI/ ); my ($header) = split(/ /, $_); $hash2{$header} = 1; } close IN;
Instead I thought I'd try doing both files at once. Like so:
my @threads = ("1","2"); # Loop through the array: foreach(@threads){ # Tell each thread to perform our 'parseLines()' subroutine. $_ = threads->create(\&parseLines, shift(@ARGV)); } #Tries to join the running threads #Some check implemented to avoid quitting the loop, before everything +is joined my @running = threads->list(threads::running); #Array of running threa +ds my @joinable = threads->list(threads::joinable); #Array of joinable th +reads my @catcher; while (scalar(@running) != 0 || scalar(@joinable) > 0) { #While as lon +g as there are running or joinable threads @running = threads->list(threads::running); #Repopulate running, n +ot sure if needed. foreach(@threads){ if ($_->is_joinable()) { push(@catcher, $_->join()); #Put's parsed file as hash-ref int +o array } } @joinable = threads->list(threads::joinable); @running = threads->list(threads::running); } sub parseLines{ open IN, '<', $_[0]; my %hash; while (<IN>) { next unless ( $_ =~ m/^\@HWI/ ); my ($header) = split(/ /, $_); $hash{$header} = 1; } close IN; return \%hash; }
While the multithreaded code works, it's about 50% slower than the first one, so nothing is really accomplished there. It seems to me that maybe the problem is that the threads take a long time passing the created hashes back to the main script. So it's shuffling things around in memory, but I'm on really thin ice here I must admit. Any input is appreciated.

Replies are listed 'Best First'.
Re: Using threads to process multiple files
by BrowserUk (Pope) on Jan 30, 2015 at 18:56 UTC
    While the multithreaded code works, it's about 50% slower than the first one

    The problem comes in two parts:

    1. As Eily already pointed out; if the two files reside on the same (physical) drive, then interleaving reads to the two files means that on average, the heads will need to do twice as many track-to-track seeks as when you read the sequentially.

      As track-to-track seek time is the slowest part of reading a disk; that means that it dominates; with the net result that you actually slow things down.

      If the files are (can be arranged to be without necessitating moving one of them), on different physical devices, much of the additional overhead is negated. If those drives are connected by separate interfaces so much the better.

    2. The second part is more insidious. When you return a hashref from a subroutine as you are doing:
      return \%hash;

      Normally, without threads, that is a very efficient operation necessitating just the copying of a reference.

      But with the threads memory model; only data that is explicitly shared can be transferred between threads. Normally, this is a good thing preventing unintended shared accesses and all the problems that can arise from them; but in this case, it means that the entire hash gets effectively duplicated; and thus is a relatively slow process for large hashes.

      This is especially annoying when the transfer is the last act of the originating thread; as once the original hash has been duplicated, it is then discarded; so there would be no risk from just transferring the reference.

      I did once look into whether threads could be patched to avoid the duplication for this special case; but the entire memory management of perl; and especially under threading; is so complex and opaque that I quickly gave up.

    There is a possible, but nascent, unproven and unpublished possible solution to the latter part of the problem. I've recently written several hash-like data-structures using Inline::C that bypass Perl's memory management entirely and allocate their memory from the CRT heap. As all that perl sees of these structures is an opaque RV pointing to a UV, it should be possible to pass one of these references between threads without Perl interfering and needing to duplicate them.

    But, the ones that would be applicable to your use were only developed as far as needed to prove that they weren't useful for my purposes and then abandoned whilst I developed my solution which whilst hash-like; is very specialised for my requirements and not useful as a general purpose hash (no iterators or deletions); and I don;t have the time to finish any of the more general ones.

    If you have C/XS skills and your need is pressing enough, I could give you what I have for you to finish.

    Of course, that would only help if you can arrange for your two files to be on different disks in order to solve or mitigate part 1 of the problem.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      To expand here (as I was about to post much of BrowserUk's point 1), I expect you'll get much better performance if you swap to slurp mode, specifically because you won't interleave reads. It also means that which ever thread reads first will almost assuredly finish its processing before thread 2 finishes its read, so your wall time will be closer to n reads, 1 process, 1 hash transfer. Perhaps something like (untested):
      sub parseLines{ my $content = do { open my $in, '<', $_[0] or die "Open failed $_[0]: $!"; local $/; <$in>; }; my %hash; for (split /(?<=\n)/, $content) { next unless /^\@HWI/; my ($header) = split(/ /, $_); $hash{$header} = 1; } return \%hash; }
      Note the indirect filehandle (it automatically closes when it goes out of scope) and the failure test on the open. If your file is very large, I believe (I could be wrong) that the for(split) construct will be abusive of memory. In that case, you could craft it as a streaming parser as:
      for ($content =~ /(.*\n?)/g) { my $line = $1; next unless $line =~ /^\@HWI/; my ($header) = split(/ /, $line); $hash{$header} = 1; }
      It's plausible that slurp mode alone wouldn't be enough to force sequential reads, in which case it might make sense to add a shared read lock.

      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        Maybe ;-)

        I usually find the there isn't a large difference between one disk read & many reads. The disk read code path is complex and has multiple layers of caching and read-ahead, so it's difficult to generalize. The OS and/or file system will do caching and read-ahead and so will the hardware layers too. (Unfortunately the hardware vendors are a secretive bunch and won't tell us what the disks and controllers get up to!). And that's before you start considering the different file systems, raid setups and controller hardware that you might be using.

        tldr; Benchmark to be sure and see what's relevant in your particular case.

      Hi. thanks for the reply. I was thinking along these lines as well. Although grateful, I think your Inline::C solution might be over my head. But I will try to look for some way of speeding it up. The harddrive I/O for sure is an issue, but it shouldn't cause the script to be slower. If anything, at least it should be able to perform as well. But like you say, the copying of the hash is probably the issues.
        The harddrive I/O for sure is an issue, but it shouldn't cause the script to be slower. If anything, at least it should be able to perform as well.

        Sorry, but that simply is not the case.

        The program below reads two large files:

        02/02/2015 15:42 10,737,418,241 big.csv 02/02/2015 15:47 12,300,000,001 big.tsv

        First concurrently and then sequentially. The timings after the __END__ token show that reading them concurrently takes 5 times longer than reading them sequentially.

        #! perl -slw use strict; use threads; use Time::HiRes qw[ sleep time ]; sub worker { my( $file, $start ) = @_; open my $in, '<', $file or die $!; sleep 0.0001 while time() < $start; my $count = 0; ++$count while <$in>; my $stop = time; return sprintf "$file:[%u] %.9f", $count, $stop - $start; } my $start = time + 1; my @workers = map threads->create( \&worker, $_, $start ), @ARGV; print $_->join for @workers; for my $file (@ARGV) { open my $in, '<', $file or die $!; my( $start, $count ) = ( time(), 0 ); ++$count while <$in>; printf "$file:[%u] %.9f\n", $count, time()-$start; } __END__ [15:49:22.32] E:\test>c:piotest.pl big.csv big.tsv big.csv:[167772161] 407.047676086 big.tsv:[100000001] 417.717574120 big.csv:[167772161] 82.103285074 big.tsv:[100000001] 81.984734058

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        Oh. And the final piece of the discussion about IO.

        I moved one of the two big files over to my old, slower, fragmented drive and re-ran the same test. Now, despite that reading from the older drive is slower, the times taken to read both files concurrently and sequentially are almost identical even though they are both connected via the same interface:

        [15:49:22.32] E:\test>c:piotest.pl big.csv big.tsv big.csv:[167772161] 407.047676086 big.tsv:[100000001] 417.717574120 big.csv:[167772161] 82.103285074 big.tsv:[100000001] 81.984734058 [16:31:59.04] E:\test>c:piotest.pl c:big.csv big.tsv c:big.csv:[167772161] 138.239881039 big.tsv:[100000001] 85.378695965 c:big.csv:[167772161] 141.292586088 big.tsv:[100000001] 83.687027931

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Using threads to process multiple files
by Eily (Monsignor) on Jan 30, 2015 at 17:10 UTC

    Actually threads won't be of any use in your case. You will win time when a thread that doesn't use the CPU (ie: it is waiting for the GPU or the hard drive to answer (which is really slow)) makes the unused CPU time available to a heavier computational part, instead of just waiting until the action is done before going to the next step. But here, you don't do much except reading from two files from two files. Which means that often, when a thread releases the CPU for the other because it is waiting for the hard drive, the other thread won't be able to do anything (the hard drive won't be ready yet), and will have to wait even before it could try to read the content of the file.

    Besides, I suppose that if you are trying to make your program faster, it means your files are pretty heavy, which may mean several hard drive accesses even with buffering, and by alternating between the two threads, you will keep jumping from one place on the HD to another, instead of reading a whole file and jumping to the other. Even with fragmentation, reading a whole file at once, and then a second is probably less expansive timewise than going from one file to the other.

Re: Using threads to process multiple files
by AnomalousMonk (Bishop) on Jan 30, 2015 at 17:40 UTC

    Look for the posts of BrowserUk pertaining to multi-thread (and multi-processor) processing. It's his meat and potatoes.


    Give a man a fish:  <%-(-(-(-<

      Hey. Thanks for the tip. I'll do that.
A reply falls below the community's threshold of quality. You may see it by logging in.