Re: About text file parsing

The regular expressions in your code presents an opportunity for running parallel. With parallel cores among us (our friends), let us take Perl for a spin. Please find below the serial and parallel demonstrations.

Serial

use strict;
use warnings;

open my $input_fh,  "<", "test.txt"   or die "open error: $!";
open my $sample_fh, ">", "sample.txt" or die "open error: $!";
open my $good_fh,   ">", "good.txt"   or die "open error: $!";

while (<$input_fh>) {
    if (/^sample\s+(\S+)/) {
        print $sample_fh $1, "\n";
    }
    elsif (/^good\s+(\S+)/) {
        print $good_fh $1, "\n";
    }
}

close $input_fh;
close $sample_fh;
close $good_fh;
[download]

Parallel

use strict;
use warnings;

use MCE;

open my $sample_fh, ">", "sample.txt" or die "open error: $!";
open my $good_fh,   ">", "good.txt"   or die "open error: $!";

# worker function

sub task {
    my ( $mce, $slurp_ref, $chunk_id ) = @_;
    my ( $sample_buf, $good_buf ) = ('', '');

    # open file handle to scalar ref
    open my $input_fh, "<", $slurp_ref;

    # append to buffers inside the loop
    while (<$input_fh>) {
        if (/^sample\s+(\S+)/) {
            $sample_buf .= $1 . "\n";
        }
        elsif (/^good\s+(\S+)/) {
            $good_buf .= $1 . "\n";
        }
    }

    close $input_fh;

    # Send buffers to the manager process to print accordingly.
    # This prevents parallel workers from garbling output handles.

    MCE->print($sample_fh, $sample_buf);
    MCE->print($good_fh, $good_buf);
}

# spawn workers early, optionally
my $mce = MCE->new(
    chunk_size  => '2m',  # 2 megabytes
    max_workers => 4,
    use_slurpio => 1,
    user_func   => \&task,
)->spawn;

# process input file(s)
$mce->process({ input_data => "test.txt" });

# shutdown workers
$mce->shutdown;

# close output handles
close $sample_fh;
close $good_fh;
[download]

50 million test

The tests were timed on a system with a NVMe SSD. Notice the user times. MCE has low overhead.

$ time perl test_serial.pl

real    0m22.225s
user    0m22.018s
sys     0m0.171s

$ time perl test_parallel.pl

real    0m5.887s
user    0m22.925s
sys     0m0.293s
[download]

Regards, Mario

In Section Seekers of Perl Wisdom