Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: About text file parsing

by marioroy (Vicar)
on Aug 30, 2018 at 12:48 UTC ( #1221371=note: print w/replies, xml ) Need Help??


in reply to About text file parsing

Greetings, dideod.yang,

The regular expressions in your code presents an opportunity for running parallel. With parallel cores among us (our friends), let us take Perl for a spin. Please find below the serial and parallel demonstrations.

Serial

use strict; use warnings; open my $input_fh, "<", "test.txt" or die "open error: $!"; open my $sample_fh, ">", "sample.txt" or die "open error: $!"; open my $good_fh, ">", "good.txt" or die "open error: $!"; while (<$input_fh>) { if (/^sample\s+(\S+)/) { print $sample_fh $1, "\n"; } elsif (/^good\s+(\S+)/) { print $good_fh $1, "\n"; } } close $input_fh; close $sample_fh; close $good_fh;

Parallel

use strict; use warnings; use MCE; open my $sample_fh, ">", "sample.txt" or die "open error: $!"; open my $good_fh, ">", "good.txt" or die "open error: $!"; # worker function sub task { my ( $mce, $slurp_ref, $chunk_id ) = @_; my ( $sample_buf, $good_buf ) = ('', ''); # open file handle to scalar ref open my $input_fh, "<", $slurp_ref; # append to buffers inside the loop while (<$input_fh>) { if (/^sample\s+(\S+)/) { $sample_buf .= $1 . "\n"; } elsif (/^good\s+(\S+)/) { $good_buf .= $1 . "\n"; } } close $input_fh; # Send buffers to the manager process to print accordingly. # This prevents parallel workers from garbling output handles. MCE->print($sample_fh, $sample_buf); MCE->print($good_fh, $good_buf); } # spawn workers early, optionally my $mce = MCE->new( chunk_size => '2m', # 2 megabytes max_workers => 4, use_slurpio => 1, user_func => \&task, )->spawn; # process input file(s) $mce->process({ input_data => "test.txt" }); # shutdown workers $mce->shutdown; # close output handles close $sample_fh; close $good_fh;

50 million test

The tests were timed on a system with a NVMe SSD. Notice the user times. MCE has low overhead.

$ time perl test_serial.pl real 0m22.225s user 0m22.018s sys 0m0.171s $ time perl test_parallel.pl real 0m5.887s user 0m22.925s sys 0m0.293s

Regards, Mario

Replies are listed 'Best First'.
Re^2: About text file parsing
by marioroy (Vicar) on Aug 30, 2018 at 13:41 UTC

    Hi again,

    One may want to have the manager-process receive and loop through @sample and @good. That will incur an additional CPU core for the manager-process itself.

    use strict; use warnings; use MCE; open my $sample_fh, ">", "sample.txt" or die "open error: $!"; open my $good_fh, ">", "good.txt" or die "open error: $!"; # worker function sub task { my ( $mce, $slurp_ref, $chunk_id ) = @_; my ( @sample, @good ); # open file handle to scalar ref open my $input_fh, "<", $slurp_ref; # append to scalars inside the loop while (<$input_fh>) { if (/^sample\s+(\S+)/) { push @sample, $1; } elsif (/^good\s+(\S+)/) { push @good, $1; } } close $input_fh; # send arrays to the manager-process MCE->gather(\@sample, \@good); } # manager function sub gather { my ( $sample, $good ) = @_; # process sample for ( @{ $sample } ) { ; } # process good for ( @{ $good } ) { ; } } # spawn workers early, optionally my $mce = MCE->new( chunk_size => '1m', # 1 megabyte max_workers => 4, use_slurpio => 1, user_func => \&task, gather => \&gather, )->spawn; # process input file(s) $mce->process({ input_data => "test.txt" }); # shutdown workers $mce->shutdown; # close output handles close $sample_fh; close $good_fh;

    The extra time comes from workers appending to local arrays. Likewise, the manager-process receiving and looping through the arrays. There are 4 workers and the manager process running simultaneously on a machine with 4 real cores.

    $ time perl test_demo.pl real 0m9.932s user 0m43.956s sys 0m0.452s

    Update:

    Interestingly, Perl v5.20 and higher take 2x longer to run. I'm not sure why. Yikes, possibly from regular expression? This is on my TODO list to check why. The above was captured from Perl v5.18.2 on the same machine.

    $ time /opt/perl-5.20.3/bin/perl test_demo.pl real 0m20.858s user 1m20.164s sys 0m8.488s

    Regards, Mario

      Once again, hi :)

      Using a simplified demonstration, regular expression appears to be 3x slower in Perl v5.20 and higher. I'm not sure why.

      use strict; use warnings; use MCE; sub task { my ( $mce, $slurp_ref, $chunk_id ) = @_; # open file handle to scalar ref open my $input_fh, "<", $slurp_ref; while (<$input_fh>) { if (/^sample\s+(\S+)/) { ; } elsif (/^good\s+(\S+)/) { ; } } close $input_fh; } MCE->new( chunk_size => '1m', max_workers => 4, use_slurpio => 1, user_func => \&task ); MCE->process({ input_data => "test.txt" }); MCE->shutdown;

      Results

      $ time /opt/perl-5.8.9/bin/perl -I. test_demo.pl real 0m3.826s user 0m14.352s sys 0m0.133s $ time /opt/perl-5.10.1/bin/perl -I. test_demo.pl real 0m4.369s user 0m16.935s sys 0m0.126s $ time /opt/perl-5.12.5/bin/perl -I. test_demo.pl real 0m4.889s user 0m18.944s sys 0m0.134s $ time /opt/perl-5.14.4/bin/perl -I. test_demo.pl real 0m4.860s user 0m18.865s sys 0m0.127s $ time /opt/perl-5.16.3/bin/perl -I. test_demo.pl real 0m4.815s user 0m18.724s sys 0m0.129s $ time /opt/perl-5.18.4/bin/perl -I. test_demo.pl real 0m4.668s user 0m18.356s sys 0m0.116s $ time /opt/perl-5.20.3/bin/perl -I. test_demo.pl real 0m14.195s user 0m49.155s sys 0m7.282s $ time /opt/perl-5.22.4/bin/perl -I. test_demo.pl real 0m14.316s user 0m49.586s sys 0m7.041s $ time /opt/perl-5.24.3/bin/perl -I. test_demo.pl real 0m14.612s user 0m50.251s sys 0m7.531s $ time /opt/perl-5.26.1/bin/perl -I. test_demo.pl real 0m14.212s user 0m49.418s sys 0m6.999s $ time /opt/perl-5.28.0/bin/perl -I. test_demo.pl real 0m14.308s user 0m49.476s sys 0m7.137s

      Regards, Mario

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1221371]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2019-10-19 16:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?