The OP mentioned a large number of text files (thousands to millions at a time, up to a couple of MB each). I think that parallelization is better broken down at the file level. Basically, create a list of input files and chunk the list instead. Since the list may range from thousands to millions, go with chunk_size 1 or 2.
Notice that workers are spawned early, before creating a large array. Create the array and pass the array reference to MCE to not make an extra copy. This is how to tackle a big job, keeping overhead low. And then, fasten your seat belt and enjoy parallelization in top or htop.
use strict;
use warnings;
use MCE;
use Time::HiRes 'time';
sub process_file {
my ($file) = @_;
}
my $mce = MCE->new(
max_workers => MCE::Util::get_ncpu(),
chunk_size => 2,
user_func => sub {
my ($mce, $chunk_ref, $chunk_id) = @_;
process_file($_) for @{ $chunk_ref };
}
)->spawn;
my @file_list = (1 .. 1_000_000); # simulate a list of 1 million files
my $start = time;
$mce->process(\@file_list);
printf "%0.3f seconds\n", time - $start;
$mce->shutdown; # reap workers
Let's find out the IPC overhead. I wonder myself.
chunk_size 1 3.773 seconds 1 million chunks
chunk_size 2 1.930 seconds 500 thousand chunks
chunk_size 10 0.423 seconds 100 thousand chunks
chunk_size 20 0.234 seconds 50 thousand chunks
It is mind-boggling nonetheless, just a fraction of a second for 50 thousand chunks. Moreover, 2 seconds will not be felt when processing 500 thousand files. Nor, 4 seconds handling 1 million files.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.