http://www.perlmonks.org?node_id=1221371


in reply to About text file parsing

Greetings, dideod.yang,

The regular expressions in your code presents an opportunity for running parallel. With parallel cores among us (our friends), let us take Perl for a spin. Please find below the serial and parallel demonstrations.

Serial

use strict; use warnings; open my $input_fh, "<", "test.txt" or die "open error: $!"; open my $sample_fh, ">", "sample.txt" or die "open error: $!"; open my $good_fh, ">", "good.txt" or die "open error: $!"; while (<$input_fh>) { if (/^sample\s+(\S+)/) { print $sample_fh $1, "\n"; } elsif (/^good\s+(\S+)/) { print $good_fh $1, "\n"; } } close $input_fh; close $sample_fh; close $good_fh;

Parallel

use strict; use warnings; use MCE; open my $sample_fh, ">", "sample.txt" or die "open error: $!"; open my $good_fh, ">", "good.txt" or die "open error: $!"; # worker function sub task { my ( $mce, $slurp_ref, $chunk_id ) = @_; my ( $sample_buf, $good_buf ) = ('', ''); # open file handle to scalar ref open my $input_fh, "<", $slurp_ref; # append to buffers inside the loop while (<$input_fh>) { if (/^sample\s+(\S+)/) { $sample_buf .= $1 . "\n"; } elsif (/^good\s+(\S+)/) { $good_buf .= $1 . "\n"; } } close $input_fh; # Send buffers to the manager process to print accordingly. # This prevents parallel workers from garbling output handles. MCE->print($sample_fh, $sample_buf); MCE->print($good_fh, $good_buf); } # spawn workers early, optionally my $mce = MCE->new( chunk_size => '2m', # 2 megabytes max_workers => 4, use_slurpio => 1, user_func => \&task, )->spawn; # process input file(s) $mce->process({ input_data => "test.txt" }); # shutdown workers $mce->shutdown; # close output handles close $sample_fh; close $good_fh;

50 million test

The tests were timed on a system with a NVMe SSD. Notice the user times. MCE has low overhead.

$ time perl test_serial.pl real 0m22.225s user 0m22.018s sys 0m0.171s $ time perl test_parallel.pl real 0m5.887s user 0m22.925s sys 0m0.293s

Regards, Mario