Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: About text file parsing

by tybalt89 (Prior)
on Aug 30, 2018 at 18:58 UTC ( #1221387=note: print w/replies, xml ) Need Help??


in reply to About text file parsing

See if it is faster reading big chunks at a time, like this simple test case (of course, modify it for your file).
This only runs the regexes once for each chunk, instead of once per line.

#!/usr/bin/perl # https://perlmonks.org/?node_id=1221282 open my $fh, '<', \<<END; ###### test.txt######## sample AA sample BB Not sample CC good boy good yyy bad aaa END local $/ = \1e6; # or bigger chunk depending on your memory size while(<$fh>) # read big chunk { $_ .= do { local $/ = "\n"; <$fh> // ''}; # read any partial line push @sample, /^sample\s+(\S+)/gm; push @good, /^good\s+(\S+)/gm; } close($fh); print "sample = @sample\n good = @good\n";

Outputs:

sample = AA BB good = boy yyy

Replies are listed 'Best First'.
Re^2: About text file parsing
by marioroy (Parson) on Aug 30, 2018 at 20:52 UTC

    That's cool, tybalt89. Each day, learn something new about Perl.

    I ran serially and parallel with "text.txt" containing 50 million lines. There is no slowness using Perl v5.20 and higher.

    Serial

    use strict; use warnings; open my $input_fh, '<', 'test.txt' or die "open error: $!"; open my $sample_fh, '>', 'sample.txt' or die "open error: $!"; open my $good_fh, '>', 'good.txt' or die "open error: $!"; # tybalt89's technique running serially # see https://www.perlmonks.org/?node_id=1221387 local $/ = \2e6; # or bigger chunk depending on your memory size while (<$input_fh>) { # read big chunk $_ .= do { local $/ = "\n"; <$input_fh> // ''}; # read any partial + line print $sample_fh join("\n", /^sample\s+(\S+)/gm), "\n"; print $good_fh join("\n", /^good\s+(\S+)/gm ), "\n"; } close $input_fh; close $sample_fh; close $good_fh;

    Parallel

    use strict; use warnings; use MCE; open my $sample_fh, '>', 'sample.txt' or die "open error: $!"; open my $good_fh, '>', 'good.txt' or die "open error: $!"; # tybalt89's technique running parallel # see https://www.perlmonks.org/?node_id=1221387 MCE->new( chunk_size => '1m', max_workers => 4, use_slurpio => 1, input_data => 'test.txt', user_func => sub { my ( $mce, $slurp_ref, $chunk_id ) = @_; local $_ = ${ $slurp_ref }; MCE->print($sample_fh, join("\n", /^sample\s+(\S+)/gm), "\n"); MCE->print($good_fh, join("\n", /^good\s+(\S+)/gm ), "\n"); } )->run; close $sample_fh; close $good_fh;

    Demo

    $ time /opt/perl-5.26.1/bin/perl demo_serial.pl real 0m15.662s user 0m15.025s sys 0m0.607s $ time /opt/perl-5.26.1/bin/perl demo_parallel.pl real 0m4.042s user 0m15.617s sys 0m0.345s

    Regards, Mario

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1221387]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2020-12-01 12:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How often do you use taint mode?





    Results (6 votes). Check out past polls.

    Notices?