Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Using threads to process multiple files

by BrowserUk (Patriarch)
on Jan 30, 2015 at 18:56 UTC ( [id://1115113]=note: print w/replies, xml ) Need Help??


in reply to Using threads to process multiple files

While the multithreaded code works, it's about 50% slower than the first one

The problem comes in two parts:

  1. As Eily already pointed out; if the two files reside on the same (physical) drive, then interleaving reads to the two files means that on average, the heads will need to do twice as many track-to-track seeks as when you read the sequentially.

    As track-to-track seek time is the slowest part of reading a disk; that means that it dominates; with the net result that you actually slow things down.

    If the files are (can be arranged to be without necessitating moving one of them), on different physical devices, much of the additional overhead is negated. If those drives are connected by separate interfaces so much the better.

  2. The second part is more insidious. When you return a hashref from a subroutine as you are doing:
    return \%hash;

    Normally, without threads, that is a very efficient operation necessitating just the copying of a reference.

    But with the threads memory model; only data that is explicitly shared can be transferred between threads. Normally, this is a good thing preventing unintended shared accesses and all the problems that can arise from them; but in this case, it means that the entire hash gets effectively duplicated; and thus is a relatively slow process for large hashes.

    This is especially annoying when the transfer is the last act of the originating thread; as once the original hash has been duplicated, it is then discarded; so there would be no risk from just transferring the reference.

    I did once look into whether threads could be patched to avoid the duplication for this special case; but the entire memory management of perl; and especially under threading; is so complex and opaque that I quickly gave up.

There is a possible, but nascent, unproven and unpublished possible solution to the latter part of the problem. I've recently written several hash-like data-structures using Inline::C that bypass Perl's memory management entirely and allocate their memory from the CRT heap. As all that perl sees of these structures is an opaque RV pointing to a UV, it should be possible to pass one of these references between threads without Perl interfering and needing to duplicate them.

But, the ones that would be applicable to your use were only developed as far as needed to prove that they weren't useful for my purposes and then abandoned whilst I developed my solution which whilst hash-like; is very specialised for my requirements and not useful as a general purpose hash (no iterators or deletions); and I don;t have the time to finish any of the more general ones.

If you have C/XS skills and your need is pressing enough, I could give you what I have for you to finish.

Of course, that would only help if you can arrange for your two files to be on different disks in order to solve or mitigate part 1 of the problem.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re^2: Using threads to process multiple files
by kennethk (Abbot) on Jan 30, 2015 at 19:27 UTC
    To expand here (as I was about to post much of BrowserUk's point 1), I expect you'll get much better performance if you swap to slurp mode, specifically because you won't interleave reads. It also means that which ever thread reads first will almost assuredly finish its processing before thread 2 finishes its read, so your wall time will be closer to n reads, 1 process, 1 hash transfer. Perhaps something like (untested):
    sub parseLines{ my $content = do { open my $in, '<', $_[0] or die "Open failed $_[0]: $!"; local $/; <$in>; }; my %hash; for (split /(?<=\n)/, $content) { next unless /^\@HWI/; my ($header) = split(/ /, $_); $hash{$header} = 1; } return \%hash; }
    Note the indirect filehandle (it automatically closes when it goes out of scope) and the failure test on the open. If your file is very large, I believe (I could be wrong) that the for(split) construct will be abusive of memory. In that case, you could craft it as a streaming parser as:
    for ($content =~ /(.*\n?)/g) { my $line = $1; next unless $line =~ /^\@HWI/; my ($header) = split(/ /, $line); $hash{$header} = 1; }
    It's plausible that slurp mode alone wouldn't be enough to force sequential reads, in which case it might make sense to add a shared read lock.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Maybe ;-)

      I usually find the there isn't a large difference between one disk read & many reads. The disk read code path is complex and has multiple layers of caching and read-ahead, so it's difficult to generalize. The OS and/or file system will do caching and read-ahead and so will the hardware layers too. (Unfortunately the hardware vendors are a secretive bunch and won't tell us what the disks and controllers get up to!). And that's before you start considering the different file systems, raid setups and controller hardware that you might be using.

      tldr; Benchmark to be sure and see what's relevant in your particular case.

Re^2: Using threads to process multiple files
by anli_ (Novice) on Feb 02, 2015 at 09:00 UTC
    Hi. thanks for the reply. I was thinking along these lines as well. Although grateful, I think your Inline::C solution might be over my head. But I will try to look for some way of speeding it up. The harddrive I/O for sure is an issue, but it shouldn't cause the script to be slower. If anything, at least it should be able to perform as well. But like you say, the copying of the hash is probably the issues.
      The harddrive I/O for sure is an issue, but it shouldn't cause the script to be slower. If anything, at least it should be able to perform as well.

      Sorry, but that simply is not the case.

      The program below reads two large files:

      02/02/2015 15:42 10,737,418,241 big.csv 02/02/2015 15:47 12,300,000,001 big.tsv

      First concurrently and then sequentially. The timings after the __END__ token show that reading them concurrently takes 5 times longer than reading them sequentially.

      #! perl -slw use strict; use threads; use Time::HiRes qw[ sleep time ]; sub worker { my( $file, $start ) = @_; open my $in, '<', $file or die $!; sleep 0.0001 while time() < $start; my $count = 0; ++$count while <$in>; my $stop = time; return sprintf "$file:[%u] %.9f", $count, $stop - $start; } my $start = time + 1; my @workers = map threads->create( \&worker, $_, $start ), @ARGV; print $_->join for @workers; for my $file (@ARGV) { open my $in, '<', $file or die $!; my( $start, $count ) = ( time(), 0 ); ++$count while <$in>; printf "$file:[%u] %.9f\n", $count, time()-$start; } __END__ [15:49:22.32] E:\test>c:piotest.pl big.csv big.tsv big.csv:[167772161] 407.047676086 big.tsv:[100000001] 417.717574120 big.csv:[167772161] 82.103285074 big.tsv:[100000001] 81.984734058

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Oh. And the final piece of the discussion about IO.

      I moved one of the two big files over to my old, slower, fragmented drive and re-ran the same test. Now, despite that reading from the older drive is slower, the times taken to read both files concurrently and sequentially are almost identical even though they are both connected via the same interface:

      [15:49:22.32] E:\test>c:piotest.pl big.csv big.tsv big.csv:[167772161] 407.047676086 big.tsv:[100000001] 417.717574120 big.csv:[167772161] 82.103285074 big.tsv:[100000001] 81.984734058 [16:31:59.04] E:\test>c:piotest.pl c:big.csv big.tsv c:big.csv:[167772161] 138.239881039 big.tsv:[100000001] 85.378695965 c:big.csv:[167772161] 141.292586088 big.tsv:[100000001] 83.687027931

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1115113]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-20 04:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found