cganote has asked for the wisdom of the Perl Monks concerning the following question:
I am approaching a problem where I've downloaded several hundred files that are about 20GB each. I need to checksum each file and compare it to the provided value to make sure each file is correct. Md5sum takes a while for files that large, and I thought I could speed this up if I ran it in parallel.
I added Parallel::Forkmanager to my repertoire for the download itself. I went ahead and just added it blindly, curious to see if the single file that it was writing to would be misformatted - and it was =D
#!/usr/bin/perl -w #testlock.pl use strict; use Parallel::ForkManager; use Fcntl qw(:flock SEEK_END); my @timenow = localtime; open (my $out, ">", "output_" . $timenow[1] . "_" . $timenow[0] . ".tx +t") || die "Could not open output: $!\n"; my $stdout = select ($out); $| = 1; select ($stdout); my @files = (1 ..100); my $fork = new Parallel::ForkManager(8); foreach my $file (@files){ $fork->start and next; my $checksum = "md5sum $file"; flock($out, LOCK_EX) or die "Cannot lock filehandle - $!\n"; seek($out, 0, SEEK_END) or die "Cannot seek - $!\n"; print $out "Analysis for file $file\n\tchecksum $checksum\n"; flock($out, LOCK_UN) or die "Cannot unlock filehandle - $!\n"; $fork->finish; } $fork->wait_all_children; close $out;
However, when running this file a hundred times, I noticed a significant number of the files came out different sizes.
Here is my understanding of the situation (please correct me kindly if I'm off base):
The filehandle that I open before the loop is shared across processes (in perlfunc). The seek pointer is maintained in a shared fashion. A problem can occur when two processes/threads simultaneously write before either can update the seek pointer, so effectively one overwrites the other at that position.
I thought that the flock call would prevent this, by requiring each thread to request and respect a lock before writing. I also thought that a file write might be buffering, which may have caused the issue.
#!/usr/bin/perl -w #test.pl use strict; use local::lib; use LWP::Simple; use Cwd; use Parallel::ForkManager; my @timenow = localtime; my @files = (1 ..100); my $fork = new Parallel::ForkManager(8); foreach my $file (@files){ $fork->start and next; open (my $out, ">>", "output_newfh_" . $timenow[1] . "_" . $timeno +w[0] . ".txt") || die "Could not open output: $!\n"; my $checksum = "md5sum $file"; print $out "Analysis for file $file\n\tchecksum $checksum\n"; close $out; $fork->finish; } $fork->wait_all_children;
This works as expected - all the file sizes are the same throughout many trials. My questions are, why didn't the first strategy work, is there something happening when a separate file handle is opened (is there automatic blocking somewhere?) that prevents overwrites, and can I guarantee that this tactic will be correct? Would it make more sense to try this using ithreads instead?
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Multi-threaded behavior, file handle dups and writing to a file
by oiskuu (Hermit) on Jan 06, 2014 at 19:28 UTC | |
by cganote (Initiate) on Jan 06, 2014 at 21:37 UTC | |
by BrowserUk (Patriarch) on Jan 07, 2014 at 14:53 UTC | |
by cganote (Initiate) on Jan 08, 2014 at 05:52 UTC | |
by MidLifeXis (Monsignor) on Jan 07, 2014 at 14:48 UTC | |
Re: Multi-threaded behavior, file handle dups and writing to a file
by sundialsvc4 (Abbot) on Jan 06, 2014 at 21:24 UTC | |
by cganote (Initiate) on Jan 08, 2014 at 06:04 UTC |