Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^5: Faster and more efficient way to read a file vertically

by johngg (Canon)
on Nov 06, 2017 at 17:32 UTC ( #1202865=note: print w/replies, xml ) Need Help??


in reply to Re^4: Faster and more efficient way to read a file vertically
in thread Faster and more efficient way to read a file vertically

An excellent suggestion!!

Adding this method to the tests and benchmark ...

unpackM => sub { # Multi-line unpack suggested by LanX seek $inFH, 0, 0; my $buffer = <$inFH>; my $lineLen = length $buffer; my $nLines = 500; my $chunkSize = $lineLen * $nLines; seek $inFH, 0, 0; my $retStr; my $fmt = qq{(x${offset}ax@{ [ $lineLen - $offset - 1 ] })*}; while ( my $bytesRead = read $inFH, $buffer, $chunkSize ) { $retStr .= join q{}, unpack $fmt, $buffer; } return \ $retStr; },

... produced a new clear winner.

ok 1 - brutish ok 2 - pushAoA ok 3 - regex ok 4 - rsubstr ok 5 - seek ok 6 - split ok 7 - substr ok 8 - unpack ok 9 - unpackM Rate pushAoA brutish split seek regex unpack substr rsubs +tr unpackM pushAoA 1.14/s -- -32% -60% -61% -90% -97% -98% -9 +8% -98% brutish 1.68/s 47% -- -41% -43% -86% -95% -96% -9 +7% -97% split 2.83/s 148% 69% -- -3% -76% -92% -94% -9 +4% -95% seek 2.93/s 157% 75% 4% -- -76% -92% -94% -9 +4% -95% regex 12.0/s 952% 618% 325% 310% -- -66% -75% -7 +5% -80% unpack 35.1/s 2970% 1993% 1140% 1097% 192% -- -27% -2 +8% -43% substr 47.8/s 4081% 2751% 1588% 1530% 297% 36% -- - +3% -22% rsubstr 49.1/s 4193% 2827% 1634% 1574% 308% 40% 3% +-- -20% unpackM 61.1/s 5247% 3546% 2059% 1985% 408% 74% 28% 2 +5% -- 1..9

However, your idea of reading and processing larger chunks of the file led me to consider whether using a mask ANDed with a larger buffer would produce any improvement. Initial attempts using a regex to pull out non-NULL characters after ANDing were not encouraging but using tr instead was much better. This routine ...

ANDmask => sub { # Multi-line AND mask by johngg seek $inFH, 0, 0; my $buffer = <$inFH>; my $lineLen = length $buffer; my $nLines = 500; my $chunkSize = $lineLen * $nLines; seek $inFH, 0, 0; my $retStr; my $mask = qq{\x00} x ${offset} . qq{\xff} . qq{\x00} x ( $lineLen - $offset - 1 ); $mask x= $nLines; while ( my $bytesRead = read $inFH, $buffer, $chunkSize ) { ( my $anded = $buffer & $mask ) =~ tr{\x00}{}d; $retStr .= $anded; } return \ $retStr; },

... seems to produce the best result so far.

ok 1 - ANDmask ok 2 - brutish ok 3 - pushAoA ok 4 - regex ok 5 - rsubstr ok 6 - seek ok 7 - split ok 8 - substr ok 9 - unpack ok 10 - unpackM Rate pushAoA brutish split seek regex unpack substr rsubstr + unpackM ANDmask pushAoA 1.11/s -- -35% -61% -62% -91% -97% -98% -98% + -98% -99% brutish 1.71/s 55% -- -39% -41% -86% -95% -96% -96% + -97% -98% split 2.82/s 155% 65% -- -3% -77% -92% -94% -94% + -95% -97% seek 2.91/s 163% 70% 3% -- -76% -92% -94% -94% + -95% -97% regex 12.3/s 1010% 617% 336% 322% -- -65% -74% -75% + -79% -88% unpack 35.0/s 3060% 1943% 1141% 1102% 185% -- -25% -27% + -40% -67% substr 46.9/s 4137% 2638% 1564% 1512% 282% 34% -- -3% + -20% -55% rsubstr 48.2/s 4254% 2714% 1610% 1556% 292% 38% 3% -- + -18% -54% unpackM 58.7/s 5194% 3321% 1979% 1914% 377% 68% 25% 22% + -- -44% ANDmask 105/s 9407% 6045% 3634% 3517% 757% 201% 124% 118% + 80% -- 1..10

I would be interested to know if any Monk can spot flaws in the benchmark?

Cheers,

JohnGG

Replies are listed 'Best First'.
Re^6: Faster and more efficient way to read a file vertically
by vr (Hermit) on Nov 07, 2017 at 01:40 UTC

    Not "flaws", but it's measuring performance per size of read buffer... (something LanX was saying needs tuning)

    ~$ perl vert4.pl Benchmark: timing 3 iterations of 1, 10, 100, 1000, 10000, 100000, 100 +0000... 1: 1 wallclock secs ( 0.94 usr + 0.01 sys = 0.95 CPU) @ 3 +.16/s (n=3) (warning: too few iterations for a reliable count) 10: 1 wallclock secs ( 0.23 usr + 0.02 sys = 0.25 CPU) @ 12 +.00/s (n=3) (warning: too few iterations for a reliable count) 100: 0 wallclock secs ( 0.17 usr + 0.01 sys = 0.18 CPU) @ 16 +.67/s (n=3) (warning: too few iterations for a reliable count) 1000: 0 wallclock secs ( 0.16 usr + 0.03 sys = 0.19 CPU) @ 15 +.79/s (n=3) (warning: too few iterations for a reliable count) 10000: 0 wallclock secs ( 0.18 usr + 0.00 sys = 0.18 CPU) @ 16 +.67/s (n=3) (warning: too few iterations for a reliable count) 100000: 0 wallclock secs ( 0.20 usr + 0.03 sys = 0.23 CPU) @ 13 +.04/s (n=3) (warning: too few iterations for a reliable count) 1000000: 1 wallclock secs ( 0.23 usr + 0.03 sys = 0.26 CPU) @ 11 +.54/s (n=3) (warning: too few iterations for a reliable count)

      I guess that the tuning parameters will vary depending on the specification of the target system and the line length of the data file. On your system the best performance (without narrowing it down further) looks to be with a 10,000 line buffer. On my rather elderly, vintage 2008 IIRC, Core 2 Duo laptop the sweet spot is around 1,000 lines for both the unpack and mask methods. Working on a 2,500,000 line file with 51 byte (inc. line terminator) lines I get the following ...

      ok 1 - ANDmask ok 2 - unpackM Rate u950 u1050 u900 u1100 u1000 A950 A1050 A1100 A900 A10 +00 u950 1.28/s -- -0% -0% -1% -1% -39% -39% -39% -39% -4 +1% u1050 1.28/s 0% -- -0% -0% -1% -39% -39% -39% -39% -4 +1% u900 1.28/s 0% 0% -- -0% -1% -39% -39% -39% -39% -4 +1% u1100 1.28/s 1% 0% 0% -- -1% -39% -39% -39% -39% -4 +1% u1000 1.29/s 1% 1% 1% 1% -- -38% -38% -39% -39% -4 +0% A950 2.10/s 65% 64% 64% 64% 62% -- -0% -0% -0% - +3% A1050 2.10/s 65% 64% 64% 64% 63% 0% -- -0% -0% - +3% A1100 2.10/s 65% 64% 64% 64% 63% 0% 0% -- -0% - +3% A900 2.11/s 65% 65% 65% 64% 63% 0% 0% 0% -- - +2% A1000 2.16/s 69% 69% 69% 68% 67% 3% 3% 3% 2% +-- 1..2

      ... with this code.

      Cheers,

      JohnGG

        Hi, johngg, I rather tried to communicate that all contestants should be placed in equal conditions, i.e. let them all read in chunks, rather than in single lines, as it was for some of them. But all the same your "ANDmask" is fastest.

        Which is weird. Simple act of extracting some part of data, instead of just indexing, requires, to be fast, modification of unrelated data. Sure, it's because of speed of "anding" and transliteration, but still... Therefore, here is something completely different (sorry I keep adjusting your setup to suite my "dna.txt"):

        use strict; use warnings; use Benchmark qw{ cmpthese timethese }; use Test::More qw{ no_plan }; use String::Random 'random_regex'; my $fn = 'dna.txt'; unless ( -e $fn ) { open my $fh, '>', $fn; print $fh random_regex( '[ACTG]{42}' ), "\n" for 1 .. 1e6; } open my $inFH, q{<}, $fn or die $!; binmode $inFH; my $buffer = <$inFH>; my $lineLen = length $buffer; my $nLines = 500; my $chunkSize = $lineLen * $nLines; my $offset = 9; # Column 10 if numbering from 1 my %methods = ( ANDmask => sub { # Multi-line AND mask by johngg seek $inFH, 0, 0; my $retStr; my $mask = qq{\x00} x ${offset} . qq{\xff} . qq{\x00} x ( $lineLen - $offset - 1 ); $mask x= $nLines; while ( my $bytesRead = read $inFH, $buffer, $chunkSize ) { ( my $anded = $buffer & $mask ) =~ tr{\x00}{}d; $retStr .= $anded; } return \ $retStr; }, pdl => sub { seek $inFH, 0, 0; use PDL; my $retStr; my $chunkPDL = zeroes( byte, $lineLen, $nLines ); my $bufRef = $chunkPDL-> get_dataref; while ( my $bytesRead = read $inFH, $$bufRef, $chunkSize ) { my $lastLine = $bytesRead / $lineLen - 1; $retStr .= ${ $chunkPDL-> slice( "$offset,0:$lastLine" ) -> get_dataref } } return \ $retStr; }, ); ok ${ $methods{ ANDmask }-> ()} eq ${ $methods{ pdl }-> ()}; cmpthese( -10, { map { $_ => $methods{ $_ } } keys %methods });

        >perl vert5.pl ok 1 Rate ANDmask pdl ANDmask 7.86/s -- -55% pdl 17.3/s 120% -- 1..1

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1202865]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2018-11-18 22:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My code is most likely broken because:
















    Results (206 votes). Check out past polls.

    Notices?