Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^2: How would you parse this? (rec-descent)

by smls (Friar)
on Oct 26, 2013 at 21:26 UTC ( [id://1059845]=note: print w/replies, xml ) Need Help??


in reply to Re: How would you parse this? (rec-descent)
in thread How would you parse this?

Capturing parens force the entire document to be copied (slow)

Can you explain?
Where and why does this happen?

  • Comment on Re^2: How would you parse this? (rec-descent)

Replies are listed 'Best First'.
Re^3: How would you parse this? (oops)
by tye (Sage) on Oct 27, 2013 at 02:31 UTC

    When you run $str =~ /(\w+)/, in order to make $1 available, the first thing Perl does (sometimes) is to hide away a copy of the value of $str. $1 just points to a substring of that copy. This is so that $1's value doesn't get corrupted just because you modify $str:

    $str = "===test==="; if( $str =~ /(\w+)/ ) { print "[$str] ($1)\n"; substr( $str, 5, 2, 'mp' ); print "[$str] ($1)\n"; } __END__ [===test===] (test) [===temp===] (test)

    This isn't a big deal in this case (nor in most cases). And there are cases where it is a design choice that makes things more efficient. But it can have a significant impact on performance when you have a 1GB document in a string and then pull out some small part of it with a regex, especially if you do it over and over again:

    use strict; use Benchmark qw< cmpthese >; my $doc = ' "a string",' x 1_024_000; cmpthese( -1, { copy => sub { return $1 if $doc =~ /(['"])/; }, substr => sub { return substr( $doc, pos($doc)-1, 1 ) if $doc =~ /['"]/g; }, } ); __END__ Rate copy substr copy 38.6/s -- -100% substr 382480/s 991549% --

    Re^3: regexp - repeatedly delete words expressed in alternation from end of string (speed) notes that this penalty (in modern versions of Perl) never applies when you use /g on your regex. So this shouldn't matter at all for the cases we are talking about here. So thanks for prompting me to dig up the details again!

    I had done some incremental changes to JSON::Tiny and some benchmarks. And I saw that removing some grouping parens did actually make JSON::Tiny a little bit faster. But I was wrong about the underlying reason.

    This also means that capturing parens isn't the reason that Parse::RecDescent has been considered significantly slower than it should be. I recall having heard that Parse::RecDescent was significantly slower than it could be because it ends up copying the whole document being parsed over and over. But that memory might be wrong. Another way that parsers can waste a lot of time copying text over and over is the classic mistake of repeatedly removing matched tokens from the front of the document string:

    if( $doc =~ s/^\s*(['"])// ) { parse_string($1);

    But I don't know whether Parse::RecDescent suffers from that particular problem much or not. And it might not suffer from any significant performance problems at all at this point.

    - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1059845]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (3)
As of 2024-04-24 19:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found