http://www.perlmonks.org?node_id=1158983

Pascal666 has asked for the wisdom of the Perl Monks concerning the following question:

I've tried the below code on a couple different Linux boxes. They all seem to get data corruption reading the index. The below works fine if you un-rem the 'next' (disabling Parallel::ForkManager) or gunzip the index beforehand and remove the ':gzip' (disabling PerlIO::gzip). Number of concurrent threads does not appear to matter. The corruption appears to always start at about the same line number for each index, but at different line numbers for different indexes. Running the same thing multiple times will sometimes yield the same exact corruption and sometimes not. A couple indexes (100K each) you can test with: 2015-27 2015-48
#!/usr/bin/perl use strict; use warnings; use Parallel::ForkManager; use PerlIO::gzip; my $pm = Parallel::ForkManager->new(12); open(IN, '<:gzip', 'wat.paths.gz') or die "can't open index"; while (my $file = <IN>) { print length($file) . ":$file\n" if length($file) > 142; #next; next if $pm->start; $pm->finish; } close IN; $pm->wait_all_children;

Replies are listed 'Best First'.
Re: PerlIO::gzip and Parallel::ForkManager do not play nice
by marioroy (Prior) on Mar 29, 2016 at 02:46 UTC

    Hi Pascal666,

    The following are three parallel demonstrations using MCE::Flow, MCE::Hobo, and threads. The shared input and output handles are managed by MCE::Shared.

    MCE::Flow and MCE::Shared

    Many-Core Engine provides chunking abilities not used here.

    use strict; use warnings; use MCE::Flow; use MCE::Shared; use PerlIO::gzip; my $IN = MCE::Shared->handle( '<:gzip', 'wat.paths.gz' ); my $OUT = MCE::Shared->handle( '>', \*STDOUT ); mce_flow { max_workers => 12 }, sub { while (my $file = <$IN>) { print $OUT length($file).":$file" if length($file) > 142; } }; close $IN;

    MCE::Hobo and MCE::Shared

    A Hobo is a migratory worker inside the machine that carries the asynchronous gene. Hobos are equipped with threads-like capability for running code asynchronously. Unlike threads, each hobo is a unique process to the underlying OS. The IPC is managed by MCE::Shared, which runs on all the major platforms including Cygwin.

    use strict; use warnings; use MCE::Hobo; use MCE::Shared; use PerlIO::gzip; my $IN = MCE::Shared->handle( '<:gzip', 'wat.paths.gz' ); my $OUT = MCE::Shared->handle( '>', \*STDOUT ); sub task { while (my $file = <$IN>) { print $OUT length($file).":$file" if length($file) > 142; } } MCE::Hobo->create('task') for 1 .. 12; # do other stuff if desired $_->join for MCE::Hobo->list; close $IN;

    threads and MCE::Shared

    The code for MCE::Hobo and threads are very similar.

    use strict; use warnings; use threads; use MCE::Shared; use PerlIO::gzip; my $IN = MCE::Shared->handle( '<:gzip', 'wat.paths.gz' ); my $OUT = MCE::Shared->handle( '>', \*STDOUT ); sub task { while (my $file = <$IN>) { print $OUT length($file).":$file" if length($file) > 142; } } threads->create('task') for 1 .. 12; # do other stuff if desired $_->join for threads->list; close $IN;

    All three examples work across the board including with Perl on Windows and Cygwin, provided the OS has the gzip binary and the PerlIO::gzip module installed.

    Update:

    The upcoming MCE::Shared 1.002 release will support the following construction by allowing the main or worker process to handle the error. I've been wanting for the shared open call to feel like the native open call.

    use MCE::Shared 1.002; mce_open my $IN, "<:gzip", "wat.paths.gz" or die "open error: $!"; mce_open my $OUT, ">", \*STDOUT or die "open error: $!";

    Kind regards, Mario