Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

RE on lines read from in-memory scalar is very slow

by Danny (Pilgrim)
on Jan 22, 2024 at 21:06 UTC ( [id://11157166]=perlquestion: print w/replies, xml ) Need Help??

Danny has asked for the wisdom of the Perl Monks concerning the following question:

If I run the code below on a file with size=35048455 and num_lines=769114 I get the following output:

0.160829 read lines from disk and do RE.
48.796606 read lines from in-memory file and do RE.

So the regular expressions (RE)s are taking about 300 times longer on the lines from the in-memory file. I'm assuming this is due to the "line" from the in-memory file actually being some reference into the scalar and that is somehow causing the RE to do a bunch of extra work, but I'd be interested in an actual explanation. Note, that when I just loop through the lines without doing the RE the disk and in-memory files take about the same amount of time, so it seems it is the RE that is causing the problem.

#!/usr/bin/env perl use warnings; use strict; use Time::HiRes qw( time ); my $file = shift @ARGV; my ($fh, $time); open $fh, "<", $file; $time = time; while(<$fh>) { /^ ?Query/; } printf "%f read lines from disk and do RE.\n", time - $time; seek $fh, 0, 0; my $s = ""; while(<$fh>) { $s .= $_; } open $fh, "<", \$s; $time = time; while(<$fh>) { /^ ?Query/; } printf "%f read lines from in-memory file and do RE.\n", time - $time;
EDIT: If you want to run my original file for consistency you can grab it at tmp.txt

Replies are listed 'Best First'.
Re: RE on lines read from in-memory scalar is very slow
by choroba (Cardinal) on Jan 22, 2024 at 21:42 UTC
    I ran the code for a file that contains my e-mail Trash folder from 2018. Its size is 998_621_862 bytes and 14_484_494 lines.

    2.474057 read lines from disk and do RE. 2.446446 read lines from in-memory file and do RE.

    The numbers are quite similar for both 5.26.1 and 5.39.6 (on Linux).

    How large is your machine's memory?

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      32 gigs of RAM. The file on disk is only 34M. My perl is 5.36.3 on cygwin.

      I just ran it on another system (Linux) on the same file and got:
      0.457707 read lines from disk and do RE. 0.675398 read lines from in-memory file and do RE.
      So it seems to be specific to my system. It doesn't seem to be memory as it only uses around 34M RAM when running on that file and my memory is nowhere near exhausted.

        > My perl is 5.36.3 on cygwin ... I just ran it on another system (Linux) on the same file ...
        > and got 0.675398 read lines from in-memory file and do RE (vs 48.796606 on cygwin)

        Wow, that is such a massive difference, I suspect something's seriously wrong with your cygwin perl v5.36.3 environment ... so I suggest you run your test program on cygwin with various tools to try to understand why on earth it's taking so long. Some tools you might try:

        Other suggestions welcome. I suspect a system level tool (such as VTune) is the best bet in this case, hopefully it will reveal something glaringly wrong (as it did for me here).

        For future reference, posting a test program using Benchmark, that anyone can run with zero effort, is the ideal way to get help on these sorts of performance problems (see Fastest way to lookup a point in a set (update: and Re: Confused by RegEx count by choroba) for examples).

        See also: Code Optimization and Performance References (and Memory Tools References)

        👁️🍾👍🦟
Re: RE on lines read from in-memory scalar is very slow
by kcott (Archbishop) on Jan 23, 2024 at 09:37 UTC

    G'day Danny,

    I can confirm and reproduce the types of results you've presented. I have Cygwin running on Win10 with the latest updates as of a few days ago.

    $ uname -a CYGWIN_NT-10.0-19045 titan 3.4.10-1.x86_64 2023-11-29 12:12 UTC x86_64 + Cygwin

    I have a variety of Perl versions installed using Perlbrew:

    $ perlbrew list * perl-5.39.3 perl-5.38.0 perl-5.36.0 perl-5.34.0 perl-5.33.5 perl-5.32.0 perl-5.30.0

    I used perl-5.39.3, perl-5.36.0 and perl-5.30.0 for the tests below.

    I didn't know too much about your test data. I created these:

    • test_data
      $ perl -E 'my $test_rec = "X" x 45; open my $fh, ">", "test_data"; $fh +->say($test_rec) for 1..769114;' $ ls -l test_data -rw-r--r-- 1 ken None 35379244 Jan 23 17:09 test_data $ wc test_data 769114 769114 35379244 test_data $ head -1 test_data XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    • test_data_Q
      $ perl -E 'my $test_rec = "Query" . "X" x 40; open my $fh, ">", "test_ +data_Q"; $fh->say($test_rec) for 1..769114;' $ ls -l test_data_Q -rw-r--r-- 1 ken None 35379244 Jan 23 17:23 test_data_Q $ wc test_data_Q 769114 769114 35379244 test_data_Q $ head -1 test_data_Q QueryXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    I made a verbatim copy of your posted code: "op_code.pl". Here's the test results:

    v5.39.3

    $ perl -v | head -2 | tail -1 This is perl 5, version 39, subversion 3 (v5.39.3) built for cygwin-th +read-multi $ ./op_code.pl test_data 0.111368 read lines from disk and do RE. 0.096441 read lines from in-memory file and do RE. $ ./op_code.pl test_data_Q 0.356852 read lines from disk and do RE. 398.759730 read lines from in-memory file and do RE.

    v5.36.0

    $ perl -v | head -2 | tail -1 This is perl 5, version 36, subversion 0 (v5.36.0) built for cygwin-th +read-multi $ ./op_code.pl test_data 0.118364 read lines from disk and do RE. 0.094379 read lines from in-memory file and do RE. $ ./op_code.pl test_data_Q 0.362871 read lines from disk and do RE. 399.115542 read lines from in-memory file and do RE.

    v5.30.0

    $ perl -v | head -2 | tail -1 This is perl 5, version 30, subversion 0 (v5.30.0) built for cygwin-th +read-multi $ ./op_code.pl test_data 0.112352 read lines from disk and do RE. 0.096379 read lines from in-memory file and do RE. $ ./op_code.pl test_data_Q 0.327187 read lines from disk and do RE. 398.730810 read lines from in-memory file and do RE.

    I have ~32GB memory available; at no time during the running of these tests was more than 6.5GB used.

    I see throughout this thread that both you and others have not encountered this anomaly on other operating systems: I can't readily discern a reason for it.

    — Ken

          >I can confirm and reproduce the types of results you've presented

      awesome! thanks for checking.

Re: RE on lines read from in-memory scalar is very slow
by hv (Prior) on Jan 23, 2024 at 00:43 UTC

    Note, that when I just loop through the lines without doing the RE the disk and in-memory files take about the same amount of time, so it seems it is the RE that is causing the problem.

    That suggests to me that the data seen in the first and second loops are not the same. If you put a counter in the loops, do they both see the same number of lines?

    In particular if you are running this on Windows, I believe it may do some automatic conversion of CRLF line endings on read, in which case building up $s in this way may result in something that would read the whole in-memory file as a single line.

    You might try the following approach to set up the in-memory file, which would be more efficient and may be more certain to give the same content:

    seek $fh, 0, 0; my $s = do { local $/; <$fh>; };

    This reads the full content of the file in one go, rather than reading it line by line and then appending. (See also what's faster than .= for more detail on why building up a string piecewise can be really inefficient.)

      Or use File::Slurper. Nice clean interface, and I believe the author actually looked into what's the fastest approach.

      This code doesn't make any difference on the relative times on the windows 11/cygwin system. If I count the lines in the original code they are the same for both. The efficiency of populating a scalar from a file is important in general, but not really relevant to the problem here.

      my $s = do { local $/; <$fh>; };

      A couple of additional pieces of info. I did 100k random seeks on the disk file and the memory file and the times were about the same. When the code is running with regex loop, it seems CPU limited.

      If this problem is only reproducible on windows + cygwin , than "OS confusion" like ...

      > CRLF line endings

      ... seems to be a plausible theory. (I seem to remember a similar discrepancy discussed here not too long ago... (or was it WSL?))

      Now I'd turn the test around.

      I'd try to autogenerate the string with various forms of line endings and try to see what happens.

      Those variants could also be written to disk and tested again.

      Unfortunately I have no windows at disposal right now.

      My bet is that a string generated with plain "\n" behaves normal.

      If not we would at least have taken the filesystem out of the equation.

      And with generated text we could test if the time consumption is linear to the the size.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        "My bet is that a string generated with plain "\n" behaves normal."

        I went back and did an additional check on the data I used for my earlier tests. Each record in that test ends with just LF. For a specific test, I created two tiny files: check_lf which just contains "qwerty<LF>"; and check_crlf in which I forced a CRLF ending, its contents are "qwerty<CR><LF>".

        $ for i in test_data test_data_Q check_lf check_crlf; do head -1 $i | +cat -vet; done XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX$ QueryXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX$ qwerty$ qwerty^M$

        For anyone unfamiliar with "cat -vet", a newline is shown as a "$" and a carriage return is shown as "^M".

        Edit: I changed several instances of NL to LF. This was for consistency with other parts of my post as well as LF being a generally recognised de facto standard (CRLF is more usual than CRNL).

        — Ken

Re: RE on lines read from in-memory scalar is very slow
by tonyc (Friar) on Feb 19, 2024 at 21:54 UTC

    Just a summary, this hasn't been fixed yet.

    There's three things contributing to this problem:

    1. when reading from file, sv_gets(), which is used to implement readline(), preallocates the SV to the size of the file handle's buffer, and for an in-memory file, the in-memory file is the buffer, so 16M in this case.
    2. on a match the regular expression engine makes a copy of the input string, and even in cases where perl normally wouldn't make a copy-on-write copy of the input string it does here
    3. Cygwin is very slow in updating the final byte of the large SV buffer (the PV), which holds the CoW reference count. This may be due to the way VirtualAlloc() is used to implement malloc on Cygwin.
    1. means we get that huge allocation for the SV, 2. means we try to make it a CoW SV, and 3 produces the horrible performance. As an incidental effect, 2 making the SV CoW means if you save a copy of the SV:
    my @save; while (<$fh>) { /successful (match with captures)/; # mark the SV as CoW push @save, $_; # extend the lifetime of the SV's PV until removed +from @save }

    the memory use of the program explodes (my machine with 32G ran out of memory testing this).

    I have a WIP fix for 2, but it unfortunately leaks in ASAN tests, so I haven't made a PR for it (it does prevent the performance problem)

      The ideal internal behavior I would hope for is that the underlying scalar from which $fh was created would get copy-on-write substrings made from it so that the 16M allocation only exists once and everything else is a reference to it. Modifying one of those substrings would hopefully only copy the segment into a new buffer the size of the substring.

      How many obstacles are standing in the way of that behavior?

      (I've noticed on other occasions that making one of my APIs handle-friendly by wrapping optional plain scalars with a file handle and then writing the body of the code to read from handles has introduced a performance hit vs. writing the code to split /\n/ and process the strings directly. It would be really cool if they performed equally)

        The problem with what I think you're suggesting is readline/sv_gets() doesn't know about the SV behind a file handle created from a reference, it's just another file - one where the buffer isn't the normal size.

        As to the modifications to the SV the file handle refers to, making such a file handle doesn't copy the SV - changes to the SV are reflected in the values returned by reads from the handle, and writes to the handle are (obviously) reflected when reading from the SV. (changes to the SV have led to some bugs).

        Note that in the match case, on success perl makes a CoW copy of the whole matched SV, and the magic for $1, $& etc copies the appropriate text from from that CoW SV. That CoW copy is part of the problem here.

        As to the multiple 16M allocations, that's the other part of the bug here, sv_gets() is preallocating the target SV to way too large a size.

        For your performance issue, you could post here (on perlmonks, not necessarily this thread) with the code and performance numbers for discussion and/or make a ticket against perl itself on github. The main issue I can think of comparing readline() vs split is readline goes through more layers (pp_readline, sv_gets, PerlIO, PerlIO::scalar vs pp_split, regexp) which is going to slow things down. A test case can be investigated with a profiler (gprof, perf record, valgrind callgrind, cygwin profiler, visual studio profiler) to see if something is consuming more than the expected CPU time.

        "(I've noticed on other occasions that making one of my APIs handle-friendly by wrapping optional plain scalars with a file handle and then writing the body of the code to read from handles has introduced a performance hit vs. writing the code to split /\n/ and process the strings directly. It would be really cool if they performed equally)"
        Can you make a new thread explaining this? I don't understand.
Re: RE on lines read from in-memory scalar is very slow
by swl (Parson) on Jan 23, 2024 at 22:26 UTC

    Given this can be reproduced on other machines with Cygwin and MSYS2, and that there is a (smaller) performance shift for Strawberry Perl from 5.20, this is probably worth reporting to p5p.

    There was another regex related issue posted here late last year in 11153747 which was reported and fixed under https://github.com/Perl/perl5/issues/21360. It manifested in 5.28 so a different cause is likely, but the GH issue includes some debugging steps that could be followed if people have the tools, time and inclination.

      I just reported it here.
Re: RE on lines read from in-memory scalar is very slow
by swl (Parson) on Jan 23, 2024 at 05:06 UTC

    It could be related to your system or to the cygwin build or its general environment.

    Running with Strawberry Perl 5.38 I see the in-memory version being faster. Results are similar for SP 5.36.1.

    perl 11157166.pl tempfile.txt is size 29 Mb 0.133742 read lines from disk and do RE (2047 matches). 0.079146 read lines from in-memory file and do RE (2047 matches).

    Edit: Tried with MSYS2 and results are slower for the in-memory loop.

    tempfile.txt is size 29 Mb 0.057704 read lines from disk and do RE (2047 matches). 0.805546 read lines from in-memory file and do RE (2047 matches).

    (End of edit)

    Modified code is below. Main change is to generate the file if needed (beware lack of overwrite safety). It also reports the number of matches.

      I just found another interesting observation. That regex I was using matches 125,277 times (16.3% of the lines) in the example file I was using. If I change the regex to /^ ?QueryXXX/; so that it matches nothing I get:
      0.114286 read lines from disk and do RE; n=769114. 0.104568 read lines from in-memory file and do RE; n=769114.
      This is probably more for the perl developers or cygwin distribution people, but still pretty quirky.

        Modifying my code so 16.3% of the lines will match the regex (previously it was 1%) gives me these timings for Strawberry Perl 5.38:

        tempfile.txt is size 29 Mb 0.121952 read lines from disk and do RE (32571 matches). 0.491674 read lines from in-memory file and do RE (32571 matches).

        So consistent with your results for Strawberry Perl.

        And for MSYS2 perl 5.38.2:

        tempfile.txt is size 29 Mb 0.064073 read lines from disk and do RE (32571 matches). 9.538524 read lines from in-memory file and do RE (32571 matches).

        So something would appear to be awry with the regex matching under MSYS2 and Cygwin.

      It could be related to your system or to the cygwin build or its general environment.

      Yeah, it is definitely specific to the cygwin perl. I just installed strawberry 5.38.0 and running from the same cygwin terminal as the original test, it looks fairly normal:

      0.231036 read lines from disk and do RE; n=769114. 1.459296 read lines from in-memory file and do RE; n=769114.
      It would be nice if someone else using cygwin could test it. I have the current cygwin version of perl which is 5.36.3.
Re: RE on lines read from in-memory scalar is very slow
by LanX (Saint) on Jan 22, 2024 at 21:42 UTC
    Not enough info!

    Two possible guesses

    • your real regex is worse than O(n) with n the length of text, alone one .* could force it to O(nē)
    • your "in memory file" doesn't fit into memory and is forcing the system to swap. And FS operations are slow.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

Re: RE on lines read from in-memory scalar is very slow (OOM variant)
by swl (Parson) on Jan 24, 2024 at 01:23 UTC

    It is possible to modify the code to get an Out of Memory error on MSYS2 5.36 and Strawberry Perl 5.20 and 5.38. This does not occur with a perlbrewed 5.36 on Ubuntu via WSL, nor Strawberry Perl 5.18.

    Collating an array of lines from the in-memory file handle is sufficient. Commenting out the regex in the in-memory file handle loop makes the OOM go away, as does modifying the string before adding it to the array.

    (I also modified the code to use a separate variable for the in-memory file handle. It has no effect but is arguably cleaner.)

    #!/usr/bin/env perl use warnings; use strict; use Time::HiRes qw( time ); use Devel::Peek; my $file = shift @ARGV; my ($fh, $time); my (@arr1, @arr2); my $use_dump = 0; if (!$file) { # should use Path::Tiny::tempfile $file = 'tempfile.txt'; open my $ofh, '>', $file or die "Cannot open $file for writing, $! +"; srand(1234567); for my $i (0..200000) { my $string = 'some random text ' . rand(); $string = $string x (1 + int (rand() * 10)); if (rand() < 0.163) { $string = " Query${string}"; } say {$ofh} $string; } $ofh->close or die "Cannot close $file, $!"; printf "%s is size %i Mb\n", $file, (-s $file) / (1028**2); } open $fh, "<", $file; my $s = do {local $/ = undef; <$fh>}; seek $fh, 0, 0; print "\n\n"; $time = time; my $match_count1 = 0; my $i1 = 0; my $xx; while(<$fh>) { /^ ?Query/ && $match_count1 ++; push @arr1, $_; if ($use_dump and /^ Query/) { Dump $_; $i1 ++; last if $i1 > 5; } } printf "%f read lines from disk and do RE ($match_count1 matches).\n", + time - $time; $fh->close; open my $mfh, "<", \$s; $time = time; my $match_count2 = 0; my $i2 = 0; while(<$mfh>) { # comment this out to avoid the OOM /^ ?Query/ && $match_count2++; #push @arr2, ($_ . ""); # avoids OOM push @arr2, $_; # OOM! if ($use_dump and /^ Query/) { Dump $_; $i2++; last if $i2 > 5; } } printf "%f read lines from in-memory file and do RE ($match_count2 mat +ches).\n", time - $time; $mfh->close;
      Very interesting. I was able to reproduce the OOM error with my cygwin perl, and the two modifications you mentioned to avoid OOM worked for me also. Strangely, when I watch the perl process in the task manager, or just watch the total Memory usage it never varies. On my system the process reaches about 230 MB memory usage and the total memory stays at about 15.3 GB total over the whole execution. It seems that it might suddenly encounter a memory leak that happens so fast that the task manager doesn't detect it before the process dies. On my system the time between the start of the loop that produces the OOM and the exception is about 22 seconds.
        Another observation. When I add a "print;" before the push @arr2, I get a file that is 352602493 / 35048455 = 10.06 times larger than the input file. The first 10% seems to match the original and then there are more lines. I'm looking at what these lines correspond to.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11157166]
Approved by Paladin
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-06-18 23:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.