http://www.perlmonks.org?node_id=1069569

cganote has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I am approaching a problem where I've downloaded several hundred files that are about 20GB each. I need to checksum each file and compare it to the provided value to make sure each file is correct. Md5sum takes a while for files that large, and I thought I could speed this up if I ran it in parallel.
I added Parallel::Forkmanager to my repertoire for the download itself. I went ahead and just added it blindly, curious to see if the single file that it was writing to would be misformatted - and it was =D

I attempted to solve it like so:
#!/usr/bin/perl -w #testlock.pl use strict; use Parallel::ForkManager; use Fcntl qw(:flock SEEK_END); my @timenow = localtime; open (my $out, ">", "output_" . $timenow[1] . "_" . $timenow[0] . ".tx +t") || die "Could not open output: $!\n"; my $stdout = select ($out); $| = 1; select ($stdout); my @files = (1 ..100); my $fork = new Parallel::ForkManager(8); foreach my $file (@files){ $fork->start and next; my $checksum = "md5sum $file"; flock($out, LOCK_EX) or die "Cannot lock filehandle - $!\n"; seek($out, 0, SEEK_END) or die "Cannot seek - $!\n"; print $out "Analysis for file $file\n\tchecksum $checksum\n"; flock($out, LOCK_UN) or die "Cannot unlock filehandle - $!\n"; $fork->finish; } $fork->wait_all_children; close $out;

However, when running this file a hundred times, I noticed a significant number of the files came out different sizes. Here is my understanding of the situation (please correct me kindly if I'm off base):
The filehandle that I open before the loop is shared across processes (in perlfunc). The seek pointer is maintained in a shared fashion. A problem can occur when two processes/threads simultaneously write before either can update the seek pointer, so effectively one overwrites the other at that position.
I thought that the flock call would prevent this, by requiring each thread to request and respect a lock before writing. I also thought that a file write might be buffering, which may have caused the issue.

I went back and tried this without sharing a filehandle:
#!/usr/bin/perl -w #test.pl use strict; use local::lib; use LWP::Simple; use Cwd; use Parallel::ForkManager; my @timenow = localtime; my @files = (1 ..100); my $fork = new Parallel::ForkManager(8); foreach my $file (@files){ $fork->start and next; open (my $out, ">>", "output_newfh_" . $timenow[1] . "_" . $timeno +w[0] . ".txt") || die "Could not open output: $!\n"; my $checksum = "md5sum $file"; print $out "Analysis for file $file\n\tchecksum $checksum\n"; close $out; $fork->finish; } $fork->wait_all_children;

This works as expected - all the file sizes are the same throughout many trials. My questions are, why didn't the first strategy work, is there something happening when a separate file handle is opened (is there automatic blocking somewhere?) that prevents overwrites, and can I guarantee that this tactic will be correct? Would it make more sense to try this using ithreads instead?

The order of the output is not important, but it must all be there. I'm running this on red hat 6 with perl 5.10. The system has flock(2) and fork. The files are all genomic data in bam format. The underlying filesystem is Lustre, which I'm hoping will play nice with the heavy I/O of the md5 call in this program. In the example programs above, I simplified the code as much as possible.

Replies are listed 'Best First'.
Re: Multi-threaded behavior, file handle dups and writing to a file
by oiskuu (Hermit) on Jan 06, 2014 at 19:28 UTC

    Could you satisfy our curiosity regarding the hardware being used? MD5 speed ought to be in the ballpark of .5 GB/sec for modern cores (single thread). What bandwidth does your disk subsystem sport?

    Regarding the problem:

    $ find . -name '*.md5' -print0 | xargs -0 -n1 -P4 md5sum -c
    
    Should run four threads of md5sum checks in parallel (provided separate checksum files exist for each image file).

      I'm running on a compute node of a cluster; the file system is mounted on a separate machine. It looks like I'm getting approx .27GB/s real time using:

      $ time md5sum filename
      real 0m34.537s
      user 0m31.455s
      sys 0m3.047s

      ..on a 9.3G file. Did this a few times.

      Thanks for the tip on xargs -P. I'll see if I can work that in instead - though I am still curious about perl forks.

        What throughput do you get if you do a simple: time cat filename >/dev/null on the same file?

        That's your device/network IO baseline. (Assuming your null device is reasonably efficient.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        the file system is mounted on a separate machine

        Update - NVM: didn't read original question well enough: Are you perhaps limited by the network-based IO present here?

        --MidLifeXis

Re: Multi-threaded behavior, file handle dups and writing to a file
by sundialsvc4 (Abbot) on Jan 06, 2014 at 21:24 UTC

    You can get so much useful work done with xargs -Pn, as shown above, if your version of Linux/Unix supports it.   Perhaps most useful, you can very quickly see whether-or-not parallelism will actually be beneficial to you, without having to “write a complicated [Perl ...] program” in order to find out.

    The only determinant of the runtime of this particular task will be:   how fast the disk-drives, channel subsystems and so-forth can move the requisite amount of data past md5sum’s nose.   The CPU processing-time pales against the I/O time, and many filesystems handle parallelism internally, on behalf of all comers, very well on their own.   Lustre might give you faster and/or more-scaleable throughput on this particular task . . . or not.   Certainly you should fiddle-around very extensively with the xargs approach to find out how your particular hardware configuration will (or won’t) respond favorably ... where the “sweet-spot” number of processes is, if it’s actually greater than 1, and so on ... then decide for yourself whether a more-elaborate approach is justified.   (Likely it won’t be, and in any case, a perl-script that is designed simply to be run in this way via xargs is much easier to bang-out than something that actually implements its own multithread controller, and it just might work as well or even better.   If you possibly can, “Jest get ’er done.”)

      I really appreciate your response. I've only used xargs before for tricky pipes and to get the -n 1 feature. I see what you mean about CPU time on this issue, now that I've run it a few times.

      If, in a case where I find it does pay off to run the script with multiple tasks, is there a good detailed overview of how the I/O is handled on duped filehandles or between processes? What I've read so far still doesn't explain the 'why it do dat' of my program, and I'd like to fill in the holes in my understanding for future cases.

      >If you possibly can, “Jest get ’er done.”
      I wish I'd asked sooner. Sometimes it's helpful to be reminded of the goal and not caught up on the implementation details!