http://www.perlmonks.org?node_id=1069569

cganote has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I am approaching a problem where I've downloaded several hundred files that are about 20GB each. I need to checksum each file and compare it to the provided value to make sure each file is correct. Md5sum takes a while for files that large, and I thought I could speed this up if I ran it in parallel.
I added Parallel::Forkmanager to my repertoire for the download itself. I went ahead and just added it blindly, curious to see if the single file that it was writing to would be misformatted - and it was =D

I attempted to solve it like so:
#!/usr/bin/perl -w #testlock.pl use strict; use Parallel::ForkManager; use Fcntl qw(:flock SEEK_END); my @timenow = localtime; open (my $out, ">", "output_" . $timenow[1] . "_" . $timenow[0] . ".tx +t") || die "Could not open output: $!\n"; my $stdout = select ($out); $| = 1; select ($stdout); my @files = (1 ..100); my $fork = new Parallel::ForkManager(8); foreach my $file (@files){ $fork->start and next; my $checksum = "md5sum $file"; flock($out, LOCK_EX) or die "Cannot lock filehandle - $!\n"; seek($out, 0, SEEK_END) or die "Cannot seek - $!\n"; print $out "Analysis for file $file\n\tchecksum $checksum\n"; flock($out, LOCK_UN) or die "Cannot unlock filehandle - $!\n"; $fork->finish; } $fork->wait_all_children; close $out;

However, when running this file a hundred times, I noticed a significant number of the files came out different sizes. Here is my understanding of the situation (please correct me kindly if I'm off base):
The filehandle that I open before the loop is shared across processes (in perlfunc). The seek pointer is maintained in a shared fashion. A problem can occur when two processes/threads simultaneously write before either can update the seek pointer, so effectively one overwrites the other at that position.
I thought that the flock call would prevent this, by requiring each thread to request and respect a lock before writing. I also thought that a file write might be buffering, which may have caused the issue.

I went back and tried this without sharing a filehandle:
#!/usr/bin/perl -w #test.pl use strict; use local::lib; use LWP::Simple; use Cwd; use Parallel::ForkManager; my @timenow = localtime; my @files = (1 ..100); my $fork = new Parallel::ForkManager(8); foreach my $file (@files){ $fork->start and next; open (my $out, ">>", "output_newfh_" . $timenow[1] . "_" . $timeno +w[0] . ".txt") || die "Could not open output: $!\n"; my $checksum = "md5sum $file"; print $out "Analysis for file $file\n\tchecksum $checksum\n"; close $out; $fork->finish; } $fork->wait_all_children;

This works as expected - all the file sizes are the same throughout many trials. My questions are, why didn't the first strategy work, is there something happening when a separate file handle is opened (is there automatic blocking somewhere?) that prevents overwrites, and can I guarantee that this tactic will be correct? Would it make more sense to try this using ithreads instead?

The order of the output is not important, but it must all be there. I'm running this on red hat 6 with perl 5.10. The system has flock(2) and fork. The files are all genomic data in bam format. The underlying filesystem is Lustre, which I'm hoping will play nice with the heavy I/O of the md5 call in this program. In the example programs above, I simplified the code as much as possible.