http://www.perlmonks.org?node_id=1185090

vr has asked for the wisdom of the Perl Monks concerning the following question:

I have a 50 Mb file:

perl -e "print 'x' x (50*1024*1024)" > x

Suppose I slurp it and do some matching:

use strict; use warnings; my $s = do { local ( @ARGV, $/ ) = 'x'; <> }; $s =~ /x/;
$ /usr/bin/time -f %M perl fmap.pl

Maximum resident set size reported as 53596 kbytes. Fair enough. Then I learn about File::Map, and do this:

use strict; use warnings; use File::Map qw/ map_file /; map_file my $s, 'x', '<'; $s =~ /x/;

105844. Twice as much memory consumed. Actually, I'd expect, quoting POD,

loading the pages lazily on access. This means you only 'pay' for the parts of the file you actually use.

-- match consumes a single byte, hence only a "page" was loaded, no? Not the whole file. Otherwise, what's the point of example in synopsis? OK, maybe I'm wrong and Perl's regex engine wants a string in RAM, physically. But, if match was unsuccessful, e.g. $s =~ /y/; then -- 54676. Looks like a copy is made on each successful match:

$s =~ /x/; $s =~ /x/; $s =~ /x/; $s =~ /x/; $s =~ /x/;

Then: 310784.

But not in a loop: $s =~ /x/ for 1 .. 5; Then, again, 105848.

That's all rather weird. Same happens on Windows, too. (There was another issue, on Windows -- it suddenly refused to map a 'merely' 1 Gb file, and it appears that CreateFileMapping expects a continuous block in virtual memory of required size -- which can either happen or not even during the same day. Doesn't look as usable to me. But perhaps it's not Perl issue.)

I'm asking, because at first I was enthusiastic about this patch. Now I'm not so sure.

Replies are listed 'Best First'.
Re: Is it File::Map issue, or another 'helpful' Perl regex optimization?
by dave_the_m (Monsignor) on Mar 18, 2017 at 09:06 UTC
    In modern perls, copy-on-write (COW) is used to make a "copy" of the string in the case of a successful match. This copy shares the same string buffer between two scalar values (but unshares them if either scalar value tries to modify its buffer). This avoids the old performance penalty that having $& etc anywhere in your script would impose upon all subsequent matches, while not crashing on something like eval '$&'.

    However, the type of string created by File::Map isn't suitable for being COWed, so perl copies the whole string instead.

    I'll add it to my list of "things to see if we can improve in COW".

    Dave.

Re: Is it File::Map issue, or another 'helpful' Perl regex optimization? (neither)
by Anonymous Monk on Mar 18, 2017 at 00:04 UTC

      Adding advise( $s, 'sequential' ); made no difference :( - I mean, in terms of consumed memory.

        :)

        the module seems vague on claims and evidence, but ,

        I just did some testing, and I get these numbers just loading a 51mb file I create

        You'll need memusage-workingset-virtualmemory.pl to run it yourself

        So, when you map, seems to signal to the OS this is how big the memory usage is going to go (WVM field), and then the working set slowly increases up to the size of the file as the regular expression advances through the whole file "line" by line

        Is this faster than something else? More memory efficien? I dunno

        I'm beginning to suspect this is how File::Map is supposed to work