Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

substitute with recursive regexps?

by hexcoder (Chaplain)
on Sep 01, 2011 at 21:34 UTC ( #923726=perlquestion: print w/replies, xml ) Need Help??
hexcoder has asked for the wisdom of the Perl Monks concerning the following question:

When I read about JavaFans recursive regexps, I found it immediately fascinating. Then I realized all examples use the regexp to match something. No example showed how to use it in a substitution.

I wonder if it is possible to do transformations on a string based on a named recursive capture.

For example given

my $pat = qr { (?(DEFINE) (?<number> (?: [0-9]+ )) (?<sign> (?: [-+] )) (?<expr> (?: (?&term) (?: \s* [-+] \s* (?&expr))? )) (?<term> (?: (?&factor) (?: \s* [/*] \s* (?&term))? )) (?<factor> (?: (?&number) | (?&sign) \s* (?&factor) | \( \s* (?&expr) \s* \))) ) (?&expr) }x;
how do you change all (?&expr) to something like Expression($+{expr})?

A suitable string could be '1+2*3' and the expected result would be Expression(Expression(1)+Expression(Expression(2)*Expression(3))).

Also interesting: how to substitute a subexpression like (?&factor).

Thanks! hexcoder

Edit: fixed a typo in the replacement text

Edit2: Better wording in the second question.

Replies are listed 'Best First'.
Re: substitute with recursive regexps?
by Anonymous Monk on Sep 03, 2011 at 14:22 UTC
    Perl regexes currently don't provide a reasonable way to extract or modify a complete parse tree from a recursive regex match. In other words, because the independent subpatterns in your (?(DEFINE)...) block don't keep the substrings they capture, it's extremely difficult to extract that hierarchical information and convert it to something else using substitutions involving vanilla recursive regexes. Basically, all you can do is match the regex and extract the entire match as a single string...or at best, extract just the top-level components.

    If you want to do something more sophisticated than that using regexes, you need the extra features provided by the Regexp::Grammars module. Have a look at the distribution's demo/ subdirectory for some examples of parsing and then manipulating arithmetic expressions (in particular: the various demo_calc*.pl files).

    And, yes, it's unfortunate that Perl now has the equivalent of full grammar-based matching, without any facility for extracting useful information from such matches. I know that Damian Conway and Nicholas Clark were discussing that very limitation (and what might possibly be done about it in a future version of Perl) at YAPC::EU a few weeks ago, so maybe Perl will eventually get a native implementation of that missing feature as well.

    But in the meantime, have a look at Regexp::Grammars (or Marpa or Parse::Yapp). Perl 5 regexes aren't (yet) full grammars and that seems to be what you want here. Any of those modules can give you that, and Regexp::Grammars can give it to you in (decidedly non-vanilla) regexes.

      Thanks for the clarification and the pointers! I will have a look. Regarding building a parse tree with a recursive regexp, I found an inspiring example at ppp from demerphq (which seems a tiny bit hackish though). It uses a different technique with assertions that evaluate Perl code:
      use strict; use warnings; use Data::Dumper; $^R=undef; my $re = qr/ (?{[$^R]}) # initialize a new node in the par +setree # preserving the old as the 0th +element \( # match a paren (?> # atomic match ([^()]+) # capture (?{ push @{$^R},$1; # push the capture string into the + current node $^R # propagate the current node }) | (?0) # recurse (?{ push @{$^R->[0]},$^R; # push the current node into the o +ld node shift @{$^R} # and restore the old node to be t +he current # (while removing it from the new) }) )* \) /x; "(a(b)(c(d)e)(f)g)" =~ m/$re/x and print "matched\n"; print Dumper $^R;
      which gives (with perl 5.10.0):
      matched $VAR1 = [ undef, 'a', [ 'b' ], [ 'c', [ 'd' ], 'e' ], [ 'f' ], 'g' ];
      But my current goal is not parsing, it is optimization of stringified output from parsed productions. It would have been cool to do optimizing transforms with recursive pattern substitutions. Then optimizations could be applied on the most nested level first, and propagate to higher levels.
        Yes, Regexp::Grammars basically just automates the same technique as demerphq's example uses: injection of a distributed stack-based tree-constructor via embedded code blocks.

        But you really want that automation if the regex gets much more complex than the above example. Apart from the obvious maintainabilty issues of sprinkling dozens of (?{...}) blocks through your regex, there's a much more fundamental problem: if the regex can ever backtrack a repetition, that will almost certainly screw up the synchronization between the stack and the parser state, unless you take special precautions to unwind the stack correctly (which is what the module does for you).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://923726]
Approved by Samy_rio
Front-paged by ww
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2017-08-19 06:48 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (310 votes). Check out past polls.