Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^3: Faster and more efficient way to read a file vertically

by johngg (Canon)
on Nov 06, 2017 at 00:36 UTC ( #1202791=note: print w/replies, xml ) Need Help??


in reply to Re^2: Faster and more efficient way to read a file vertically
in thread Faster and more efficient way to read a file vertically

Actually unpack was suggested (and not implemented) by me first. ;)

Ah! Sorry, I missed that :-/

Cheers,

JohnGG

  • Comment on Re^3: Faster and more efficient way to read a file vertically

Replies are listed 'Best First'.
Re^4: Faster and more efficient way to read a file vertically
by LanX (Archbishop) on Nov 06, 2017 at 01:13 UTC
    >Ah! Sorry, I missed that :- /

    No problem at all... ;-)

    I just wanted to point you to the possibility that unpack can process many lines at the same time, (which essentially means "reading vertically").

    use strict; use warnings; my $line_count = 10; my $line = join ("", 'a'..'z',"A".."Z") . "\n"; my $file= "$line" x $line_count; my $offset = 2; my $rest = length($line) - $offset -1; my $fmt = qq{(x${offset}ax${rest})${line_count}}; my @results = unpack $fmt, $file; print @results;

    Will print cccccccccc because of offset 2

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      An excellent suggestion!!

      Adding this method to the tests and benchmark ...

      unpackM => sub { # Multi-line unpack suggested by LanX seek $inFH, 0, 0; my $buffer = <$inFH>; my $lineLen = length $buffer; my $nLines = 500; my $chunkSize = $lineLen * $nLines; seek $inFH, 0, 0; my $retStr; my $fmt = qq{(x${offset}ax@{ [ $lineLen - $offset - 1 ] })*}; while ( my $bytesRead = read $inFH, $buffer, $chunkSize ) { $retStr .= join q{}, unpack $fmt, $buffer; } return \ $retStr; },

      ... produced a new clear winner.

      ok 1 - brutish ok 2 - pushAoA ok 3 - regex ok 4 - rsubstr ok 5 - seek ok 6 - split ok 7 - substr ok 8 - unpack ok 9 - unpackM Rate pushAoA brutish split seek regex unpack substr rsubs +tr unpackM pushAoA 1.14/s -- -32% -60% -61% -90% -97% -98% -9 +8% -98% brutish 1.68/s 47% -- -41% -43% -86% -95% -96% -9 +7% -97% split 2.83/s 148% 69% -- -3% -76% -92% -94% -9 +4% -95% seek 2.93/s 157% 75% 4% -- -76% -92% -94% -9 +4% -95% regex 12.0/s 952% 618% 325% 310% -- -66% -75% -7 +5% -80% unpack 35.1/s 2970% 1993% 1140% 1097% 192% -- -27% -2 +8% -43% substr 47.8/s 4081% 2751% 1588% 1530% 297% 36% -- - +3% -22% rsubstr 49.1/s 4193% 2827% 1634% 1574% 308% 40% 3% +-- -20% unpackM 61.1/s 5247% 3546% 2059% 1985% 408% 74% 28% 2 +5% -- 1..9

      However, your idea of reading and processing larger chunks of the file led me to consider whether using a mask ANDed with a larger buffer would produce any improvement. Initial attempts using a regex to pull out non-NULL characters after ANDing were not encouraging but using tr instead was much better. This routine ...

      ANDmask => sub { # Multi-line AND mask by johngg seek $inFH, 0, 0; my $buffer = <$inFH>; my $lineLen = length $buffer; my $nLines = 500; my $chunkSize = $lineLen * $nLines; seek $inFH, 0, 0; my $retStr; my $mask = qq{\x00} x ${offset} . qq{\xff} . qq{\x00} x ( $lineLen - $offset - 1 ); $mask x= $nLines; while ( my $bytesRead = read $inFH, $buffer, $chunkSize ) { ( my $anded = $buffer & $mask ) =~ tr{\x00}{}d; $retStr .= $anded; } return \ $retStr; },

      ... seems to produce the best result so far.

      ok 1 - ANDmask ok 2 - brutish ok 3 - pushAoA ok 4 - regex ok 5 - rsubstr ok 6 - seek ok 7 - split ok 8 - substr ok 9 - unpack ok 10 - unpackM Rate pushAoA brutish split seek regex unpack substr rsubstr + unpackM ANDmask pushAoA 1.11/s -- -35% -61% -62% -91% -97% -98% -98% + -98% -99% brutish 1.71/s 55% -- -39% -41% -86% -95% -96% -96% + -97% -98% split 2.82/s 155% 65% -- -3% -77% -92% -94% -94% + -95% -97% seek 2.91/s 163% 70% 3% -- -76% -92% -94% -94% + -95% -97% regex 12.3/s 1010% 617% 336% 322% -- -65% -74% -75% + -79% -88% unpack 35.0/s 3060% 1943% 1141% 1102% 185% -- -25% -27% + -40% -67% substr 46.9/s 4137% 2638% 1564% 1512% 282% 34% -- -3% + -20% -55% rsubstr 48.2/s 4254% 2714% 1610% 1556% 292% 38% 3% -- + -18% -54% unpackM 58.7/s 5194% 3321% 1979% 1914% 377% 68% 25% 22% + -- -44% ANDmask 105/s 9407% 6045% 3634% 3517% 757% 201% 124% 118% + 80% -- 1..10

      I would be interested to know if any Monk can spot flaws in the benchmark?

      Cheers,

      JohnGG

        Not "flaws", but it's measuring performance per size of read buffer... (something LanX was saying needs tuning)

        ~$ perl vert4.pl Benchmark: timing 3 iterations of 1, 10, 100, 1000, 10000, 100000, 100 +0000... 1: 1 wallclock secs ( 0.94 usr + 0.01 sys = 0.95 CPU) @ 3 +.16/s (n=3) (warning: too few iterations for a reliable count) 10: 1 wallclock secs ( 0.23 usr + 0.02 sys = 0.25 CPU) @ 12 +.00/s (n=3) (warning: too few iterations for a reliable count) 100: 0 wallclock secs ( 0.17 usr + 0.01 sys = 0.18 CPU) @ 16 +.67/s (n=3) (warning: too few iterations for a reliable count) 1000: 0 wallclock secs ( 0.16 usr + 0.03 sys = 0.19 CPU) @ 15 +.79/s (n=3) (warning: too few iterations for a reliable count) 10000: 0 wallclock secs ( 0.18 usr + 0.00 sys = 0.18 CPU) @ 16 +.67/s (n=3) (warning: too few iterations for a reliable count) 100000: 0 wallclock secs ( 0.20 usr + 0.03 sys = 0.23 CPU) @ 13 +.04/s (n=3) (warning: too few iterations for a reliable count) 1000000: 1 wallclock secs ( 0.23 usr + 0.03 sys = 0.26 CPU) @ 11 +.54/s (n=3) (warning: too few iterations for a reliable count)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1202791]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2018-11-19 00:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My code is most likely broken because:
















    Results (206 votes). Check out past polls.

    Notices?