Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Dynamic regex assertions, capturing groups, and parsers: joy and terror

by japhy (Canon)
on Oct 03, 2005 at 14:59 UTC ( #496948=perlmeditation: print w/ replies, xml ) Need Help??

I'm pushing the limits of Perl's regexes, and I've come across an ugliness. I'm trying to write a simple parser that produces a tree structure that represents the data being parsed. (Specifically, parsing eBay search strings into a logic tree.) It appears that the "postponed regular expression" assertion, (??{ CODE }), does not play well with capturing groups. Observe:
# prints 'j' "japhy" =~ m{ (.) (?{ print $1 }) }x; # prints nothing (undef, specifically) $rx = qr{ (.) (??{ print $1 }) }x; "japhy" =~ m{ (??{ $rx }) }x;
I know it's "experimental", but if this doesn't work now, it probably hasn't worked ever, which means nothing's been done about it, and I'm sure it's been reported as a bug before. The work-around I'm employing is shown in my code below. The code I'm showing is a proof-of-concept that $^R can be used in conjunction with (??{ ... }), although I'm sure I'm not the first person to attempt this.
use Data::Dumper; $Data::Dumper::Indent = 1; use strict; sub ebay_search_logic { my $str = shift; my ($word, $neg, $alt); $word = qr{ (?{ save_pos() }) (\w+) (?{ push_word() }) }x; $neg = qr{ - (??{ $word }) (?{ mod_neg() }) }x; $alt = qr{ \( (??{ $word }) (?{ alt1(); }) (?: , (??{ $word }) (?{ a +lt2() }) )+ \) }x; return $str =~ m{ (?{ [] }) ^ \s* (?: (??{ $word }) | (??{ $neg }) | (??{ $alt }) ) (?: \s+ (?: (??{ $word }) | (??{ $neg }) | (??{ $alt }) ) )* \s* $ (?{ print Dumper($^R); $^R; }) }x; return $str; } print ebay_search_logic("this that those"), "\n"; # LIKE 'this' AND + LIKE 'that' AND LIKE 'those' print ebay_search_logic("this -that those"), "\n"; # LIKE 'this' AND + (NOT LIKE 'that') AND LIKE 'those' print ebay_search_logic("this (that,those)"), "\n"; # LIKE 'this' AND + (LIKE 'that' OR LIKE 'those') sub save_pos { my @r = @{ $^R }; [ @r, $+[0] ]; } sub push_word { my @r = @{ $^R }; my $p = pop @r; my $w = substr($_, $p, $+[0] - $p); [ @r, { WORD => $w } ]; } sub mod_neg { my @r = @{ $^R }; my $w = pop @r; [ @r, { NOT => $w->{WORD} } ]; } sub alt1 { my @r = @{ $^R }; my $w = pop @r; [ @r, { ALT => [ $w->{WORD} ] } ]; } sub alt2 { my @r = @{ $^R }; my $w = pop @r; my $alt = pop @r; [ @r, { ALT => [ @{ $alt->{ALT} }, $w->{WORD} ] } ]; }

Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

Comment on Dynamic regex assertions, capturing groups, and parsers: joy and terror
Select or Download Code
Replies are listed 'Best First'.
Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror
by blokhead (Monsignor) on Oct 03, 2005 at 17:07 UTC
    At first I thought it might have to do with the bizarre static/dynamic scoping duality of the $1, $2, ... variables, since the postponed (??{CODE}) blocks may be compiled somewhere where they can't reach the correct $1, $2, .... But they also can't seem to access @- and @+, which I don't think have the same scoping properties.

    And now I don't know what to think, because of the following example: I add an empty capturing group before the (??{CODE}) block in the outermost match, and it works (or at least seems to):

    # This is perl, v5.8.6 built for i386-linux-thread-multi use Data::Dumper; $rx = qr{ (.) (?{ print Dumper $1 }) }x; "japhy" =~ m{ (??{ $rx }) }x; ## $VAR1 = undef; $rx = qr{ (.) (?{ print Dumper $1 }) }x; "japhy" =~ m{ () (??{ $rx }) }x; ## $VAR1 = 'j';
    If you were to dump @- and @+ from inside the first example, you'd see that @- has two entries, but @+ has one. It's as if $1 was only partially "set up"..

    Now I don't know if this "workaround" helps you or not. It could probably allow you to write the parser closer to how you originally envisioned. But it seems like a more fragile workaround than the one you have, and I'm not sure how much I trust it. I think I remember seeing similar weirdness with capturing parens somewhere else, but I can't find the reference at the moment.

    Then there's the fact that doing this with a regex is silly, when you could write a RecDescent grammar in about 5 seconds.. but I know you know that ;)

    blokhead

Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror
by itub (Priest) on Oct 03, 2005 at 19:03 UTC
    I once wrote a parser as a huge regex, full of "experimental" features.

    I'll never do it again.

    I started getting all sorts of weird errors such as segfaults, and they varied wildly between perl versions. I could never find a simple example to reproduce the crash, so I couldn't even file a proper bug report. Also, the huge regex was unmaintainable. Now I'd rather use a parser generator. I'm happy with Parse::YAPP. It's not as trendy as Parse::RecDescent, but it's way faster in my experience.

      It might help to remind that you can't run anything using the regex engine while you're inside a (?{...}) or (??{...}) block. You'll usually get segfaults and such if you do that. The engine isn't re-entrant and if you invoke a regex during a regex, you scribble on memory. Its supposed to have gotten better during 5.8 but I haven't tried it again.

        Thanks, that might be what was happening! I was calling subroutines from the ?{} blocks and it is very likely that some of them used regexes internally.

      A much better alternative that using lots of experimental features to write a single-regex parser is to split the matching across lots of /gc regexes. The resulting code is much easier to follow too, and you don’t need contortions to keep a grip on backtracking (my kingdom for Perl6’s commit!).

      Makeshifts last the longest.

Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror
by demerphq (Chancellor) on Oct 03, 2005 at 22:32 UTC

    I find this quite confusing as well.

    Update: No this makes perfect sense. Thanks to dio for straightening me out.

    $rx = qr{ (.) (??{ print $1 }) }x; print "!" if "japhy" =~ $rx; __END__ japhy

    How does $1 end up being 'japhy' with this re? Interestingly, changing it to

    $rx = qr{ (.) (??{ print $1; '' }) }x; print "!" if "japhy" =~ $rx; __END__ j!

    makes things work out properly.

    ---
    $world=~s/war/peace/g

      The result of (??{ print $1 }) is 1 because print() succeeded in writing to STDOUT. The regex that was then compiled by (??{ ...}) was "1" which then failed. So the (.) advanced over every character and printed them individually. The proper thing to do here would have been (?{ ... }) which will not affect regex matching.

        Doh. Of course. I knew that the print returning 1 failed the match, but i didn't put two and two together to realize that was why all of the chars were printed. And I've used this technique deliberately before too. /gah.

        Thanks for the clue-by-four. :-)

        ---
        $world=~s/war/peace/g

Re: Dynamic regex assertions, capturing groups, and parsers: joy and terror
by QM (Vicar) on Oct 03, 2005 at 21:49 UTC
    I expect you want to gripe about the behavior of (??{ CODE }), so this may be out of hand...

    Would Parse::RecDescent be useful for parsing into a tree structure for you?

    I ask because I saw an interesting talk on this last week at the Toronto Perl Mongers meeting, and thought I should give it a shot for my next parsing project.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://496948]
Approved by cristian
Front-paged by gmax
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2015-07-31 02:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (274 votes), past polls