Re^2: How would you parse this? (rec-descent)

Replies are listed 'Best First'.

Re^3: How would you parse this? (oops)
by tye (Sage) on Oct 27, 2013 at 02:31 UTC

When you run $str =~ /(\w+)/, in order to make $1 available, the first thing Perl does (sometimes) is to hide away a copy of the value of $str. $1 just points to a substring of that copy. This is so that $1's value doesn't get corrupted just because you modify $str:

$str = "===test===";
if( $str =~ /(\w+)/ ) {
    print "[$str] ($1)\n";
    substr( $str, 5, 2, 'mp' );
    print "[$str] ($1)\n";
}
__END__
[===test===] (test)
[===temp===] (test)
[download]

This isn't a big deal in this case (nor in most cases). And there are cases where it is a design choice that makes things more efficient. But it can have a significant impact on performance when you have a 1GB document in a string and then pull out some small part of it with a regex, especially if you do it over and over again:

use strict;
use Benchmark qw< cmpthese >;
my $doc = ' "a string",' x 1_024_000;
cmpthese( -1, {
    copy => sub {
        return $1
            if  $doc =~ /(['"])/;
    },
    substr => sub {
        return substr( $doc, pos($doc)-1, 1 )
            if  $doc =~ /['"]/g;
    },
} );
__END__
           Rate    copy  substr
copy     38.6/s      --   -100%
substr 382480/s 991549%      --
[download]

Re^3: regexp - repeatedly delete words expressed in alternation from end of string (speed) notes that this penalty (in modern versions of Perl) never applies when you use /g on your regex. So this shouldn't matter at all for the cases we are talking about here. So thanks for prompting me to dig up the details again!

I had done some incremental changes to JSON::Tiny and some benchmarks. And I saw that removing some grouping parens did actually make JSON::Tiny a little bit faster. But I was wrong about the underlying reason.

This also means that capturing parens isn't the reason that Parse::RecDescent has been considered significantly slower than it should be. I recall having heard that Parse::RecDescent was significantly slower than it could be because it ends up copying the whole document being parsed over and over. But that memory might be wrong. Another way that parsers can waste a lot of time copying text over and over is the classic mistake of repeatedly removing matched tokens from the front of the document string:

   if( $doc =~ s/^\s*(['"])// ) {
        parse_string($1);
[download]

But I don't know whether Parse::RecDescent suffers from that particular problem much or not. And it might not suffer from any significant performance problems at all at this point.

- tye

[reply]
[d/l]
[select]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks