Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Regex code block executes twice per match using look-arounds

by johngg (Canon)
on Jul 12, 2007 at 10:27 UTC ( [id://626187]=perlquestion: print w/replies, xml ) Need Help??

johngg has asked for the wisdom of the Perl Monks concerning the following question:

I've come across something a bit strange when using substitution at a point determined by look-behind and -ahead assertions. I used a regex code block to print some diagnostic text whenever a match occurred but found to my surprise that the code block seemed to be called twice for each match. Here's a small script that demonstrates the behaviour. It is just inserting a plus sign into a string at any point between pseudo-tags with different sorts of bracket pairs.

use strict; use warnings; use re q{eval}; sub doSep { print q{-} x 40, qq{\n}; } my $count = 0; my $string = q{<x1>[x2]{x3}(x4)}; my $rxClose = qr@[]>})]@; my $rxOpen = qr@[[<{(]@; my $rxBetween = qr {(?x) (?<=($rxClose)) (?=($rxOpen)) (?{print qq{Match @{ [++ $count] }: left $1, right $2\n}}) }; print qq{ Before: $string\n}; doSep(); $string =~ s{$rxBetween}{+}g; doSep(); print qq{ After: $string\n};

Even though the code block seems to execute twice per match the substitution is only done once. Here's the output, six apparent matches, weird.

Before: <x1>[x2]{x3}(x4) ---------------------------------------- Match 1: left >, right [ Match 2: left >, right [ Match 3: left ], right { Match 4: left ], right { Match 5: left }, right ( Match 6: left }, right ( ---------------------------------------- After: <x1>+[x2]+{x3}+(x4)

If I change the code so that look-arounds are not used the strange behaviour does not happen and the code block is only executed once per match.

use strict; use warnings; use re q{eval}; sub doSep { print q{-} x 40, qq{\n}; } my $count = 0; my $string = q{<x1>[x2]{x3}(x4)}; my $rxClose = qr@[]>})]@; my $rxOpen = qr@[[<{(]@; my $rxBetween = qr {(?x) ($rxClose) ($rxOpen) (?{print qq{Match @{ [++ $count] }: left $1, right $2\n}}) }; print qq{ Before: $string\n}; doSep(); $string =~ s{$rxBetween}{$1+$2}g; doSep(); print qq{ After: $string\n};

The output this time, just three matches as expected.

Before: <x1>[x2]{x3}(x4) ---------------------------------------- Match 1: left >, right [ Match 2: left ], right { Match 3: left }, right ( ---------------------------------------- After: <x1>+[x2]+{x3}+(x4)

I have done other testing to try and pin down what is happening. The code block also executes twice if I am just matching using look-arounds rather than substituting. I have also tried using just a single look-behind rather than both lokk-behind and -ahead but still get the double execution.

Can any Monk throw some light on what is happening here?

Cheers,

JohnGG

Replies are listed 'Best First'.
Re: Regex code block executes twice per match using look-arounds
by jettero (Monsignor) on Jul 12, 2007 at 10:49 UTC
    That is actually the usual behavior. The regex is usually implemented as a DFA (Finite State Machine), and occasionally they have to backtrack to complete the match. Your code is executed at the time the DFA moves through your (?{}) block even if it doesn't make it to the end which is when the DFA "matches."

    In this case, it is your lookbehind that's causing the DFA to remember something, wander forward to see if it's relevant, and wander back when it isn't... If you put your (?{}) block at the very end it would only print when it actually finds a match.

    -Paul

      Thank you for the reply. I'm obviously not understanding something as I thought my (?{...}) block was at the very end of the regular expression. I didn't think it would be encountered and triggered unless both look-behind and look-ahead succeeded. I expected the matching to go something like this

      Initialise pointer to beginning of string test look-behind, nothing preceding so fails advance pointer one place test look-behind, '{' so fails advance pointer one place test look-behind, 'x' so fails advance pointer one place test look-behind, '1' so fails advance pointer one place test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute advance pointer one place test look-behind, '[' so fails ...

      If the DFA had to look past my code block to check something else and then backtrack, the behaviour would make sense. However, I can't see how that is happening in this case.

      I'm sure it must be me failing to get my head around something fundamental. Please could you point out where my understanding is lacking.

      Cheers,

      JohnGG

        test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute advance pointer one place

        That last step is wrong. It should be

        advance pointer the length of the match

        The length of the match is zero in the case where the lookahead and lookbehind is used. Since the pointer is not advanced, the regexp matches everything twice. Only a final check prevents the regexp from returning an identical match.

        test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute same match? no, continue advance pointer the length of the match (0) test look-behind, '}' so succeeds test look-ahead, '[' also succeeds code block encountered, execute same match? yes, backtrack

        More on this in Re: Regex code block executes twice per match using look-arounds.

        Yes, well, the zero-width of the look behind and look ahead is clearly causing it to do something other than the obvious. I suspect the NFA (thanks sgt, interesting) is looking for something after the zero width thingies... Perhaps it would behave less oddly if you put a '.' before your code? Who really knows what the *FA is really doing?

        Any way you look at it, your regex is unusual since all of the expressions are zero-width, so they kinda match? And the when of code embedded in the regex isn't all that well defined I bet since the perlre page calls it experimental...

        I don't think it's changed in the last 5 years. I wonder when it'll stop being experimental.

        -Paul

      Just a minor pick: perl (ir)regular expressions are usually put in the NFA category meaning essentially some pathological (and not-so pathological) patterns can run for a long time (like double loops: '(blah.*)+'). This is a trade-off for getting more useful features.

      The Owl book (Mastering Regular Expressions) talks about 3 categories: DFA, NFA, POSIX NFA. In an alternation pattern, perl uses the first alternation to match, without checking for a possible longer match, making it non POSIX.

      There was an interesting discussion in p5p a few months ago motivated by the following article regex matching can be simple and fast but... One conclusion was that using some hybrid DFA/NFA scheme like the one used by gawk which uses GNU regex library could be nice, when you want to garantee a decent running time and are not using features that force the NFA engine to kick in. You'll get the best of all worlds with 5.10 that allows pluggable regex libraries :)

      cheers --stephan Note: the compiled form of a regex is usually quite different in both schemes, furthermore the mathematical equivalence between NFA and DFA possibly useful for simple models of regex (few operators) is not used.
Re: Regex code block executes twice per match using look-arounds
by ikegami (Patriarch) on Jul 12, 2007 at 18:21 UTC

    The code block also executes twice if I am just matching using look-arounds rather than substituting. ... Can any Monk throw some light on what is happening here?

    There's one more check after your (?{ print ... }).

    Right before declaring a match, Perl does an extra check when g is used. Perl checks if the starting position of the match (pos) and the length of the match is the same as the last loop through the regexp. If it is, the match is nulified, pos is incremented, and another match is attempted.

    Consider, the following two snippets.

    while ('abc' =~ /(.)/g) { print($1); }
    while ('abc' =~ /(?=(.))/g) { print($1); }

    In the first snippet, the regexp matches a character every pass, so pos is always incremented by one (the length of the match) every pass.

    In the second snippet, the regexp matches zero characters every pass (since (?=...) always matches zero characters when it matches), so pos is incremented by zero (the length of the match) every pass.

    Therefore, without the afformentioned check, pos will never advance beyond zero in the second snippet, printing an infinite stream of as. With the check, Perl realizes it matched pos=0 len=0 twice and takes action to correct that.

    You can see that effect by running the following snippet.

    $_ = 'abc'; while (/ ( (?=.) ) (?{ printf("pos=%d len=%d\n", pos($_), length($1)); }) /xg) { print("Match\n"); } print("\n"); while (/ ( . ) (?{ printf("pos=%d len=%d\n", pos($_), length($1)); }) /xg) { print("Match\n"); }
    pos=0 len=0 Match pos=0 len=0 # Same pos & len as last pass -> No match. pos=1 len=0 Match pos=1 len=0 # Same pos & len as last pass -> No match. pos=2 len=0 Match pos=2 len=0 # Same pos & len as last pass -> No match. pos=1 len=1 Match pos=2 len=1 Match pos=3 len=1 Match
Re: Regex code block executes twice per match using look-arounds
by mrpeabody (Friar) on Jul 12, 2007 at 18:54 UTC
    You can get some insight into how this happens in your code with use re 'debug'. I've altered the target string to {x1}[x2] for simplicity.

    After several unsuccessful attempts, the engine finds a match starting at position 4:

    Setting an EVAL scope, savestack=25 4 <{x1}> <[x2]> | 1: IFMATCH[-1] 3 <{x1> <}[x2]> | 3: OPEN1 3 <{x1> <}[x2]> | 5: ANYOF[)>\]}] 4 <{x1}> <[x2]> | 16: CLOSE1 4 <{x1}> <[x2]> | 18: SUCCEED could match... 4 <{x1}> <[x2]> | 20: IFMATCH[-0] 4 <{x1}> <[x2]> | 22: OPEN2 4 <{x1}> <[x2]> | 24: ANYOF[(<[{] 5 <{x1}[> <x2]> | 35: CLOSE2 5 <{x1}[> <x2]> | 37: SUCCEED could match... 4 <{x1}> <[x2]> | 39: EVAL re_eval 0x10033008 Match 1: left }, right [ 4 <{x1}> <[x2]> | 41: END Match successful!

    Then it starts again at position 4, and matches again but then throws it away:

    Setting an EVAL scope, savestack=37 4 <{x1}> <[x2]> | 1: IFMATCH[-1] 3 <{x1> <}[x2]> | 3: OPEN1 3 <{x1> <}[x2]> | 5: ANYOF[)>\]}] 4 <{x1}> <[x2]> | 16: CLOSE1 4 <{x1}> <[x2]> | 18: SUCCEED could match... 4 <{x1}> <[x2]> | 20: IFMATCH[-0] 4 <{x1}> <[x2]> | 22: OPEN2 4 <{x1}> <[x2]> | 24: ANYOF[(<[{] 5 <{x1}[> <x2]> | 35: CLOSE2 5 <{x1}[> <x2]> | 37: SUCCEED could match... 4 <{x1}> <[x2]> | 39: EVAL re_eval 0x10033008 Match 2: left }, right [ 4 <{x1}> <[x2]> | 41: END Match possible, but length=0 is smaller than requested=1, failing!
    The whole output looks like this:
      Thank you for your reply. As you can see from my reply to ikegami we have been thinking in parallel. I have tried to get my head around the output from re debug before but found the learning curve very steep. Today in the documentation I spotted use re 'debugcolor'; which gave me enough of a helping hand (even in black and white) to start to penetrate the mystery. I have been meaning to get the Owl Book mentioned by sgt for some time but today has shown that I have a long way to go in my understanding; buying the book is now a priority.

      Cheers,

      JohnGG

        Yeah, it's cryptic. There's some info on interpreting it in perldebguts .
Re: Regex code block executes twice per match using look-arounds
by BrowserUk (Patriarch) on Jul 12, 2007 at 20:05 UTC

    I've also been trying to interprete the output from use re 'debug';, but the message "Match possible, but length=0 is smaller than requested=1, failing!" has me stumped.

    But I did find a workaround for the problem. Adding a non-greedy 0 or 1 quantifier to the code assertion prevents it from being executed twice:

    my $rxBetween = qr/ (?<= []>})] ) (?= [[<{(] ) (?{ print "'$`'$&'$''" })?? /x;

    This simplified (to reduce the trace output) version of your regex now produces

    C:\test>junk6 2>junk6.trace Before: (x1)[x2]{x3}(x4) ---------------------------------------- '(x1)''[x2]{x3}(x4)' '(x1)[x2]''{x3}(x4)' '(x1)[x2]{x3}''(x4)' ---------------------------------------- After: (x1)+[x2]+{x3}+(x4)

    I've had a go at coming up with an explanation ,from looking at the way it changes the debug trace...but I decided that I am well up on my mental torture quota for this month, so I'll let someone else have that pleasure :)


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Thank you for your reply. I would never have thought of trying that. I didn't even realise you could use a quantifier on a code block. Doing that along with re debug shows that the code block doesn't execute at the first and successful match but at the second retried match when the engine realises that, although successful, it has already been here before. I wonder why the non-greediness doesn't hold the second time around.

      Exploring quantifiers with code blocks could be very interesting.

      $ perl -le ' > $str = q{abc}; > $str =~ m{(b)(?{print qq{Match $1}}){3}};' Match b Match b Match b $

      Thanks again,

      JohnGG

        I would never have thought of trying that.

        You'll note that I carefully omitted any description of the tortured, and almost certainly erroneous, logic that led me to try such a thing in the first place.

        Since 'discovering' the idea of quantifiers on zero-length assertions, I've been trying to think of a good use for it. Other than simulating a stutter, I haven't thought of anything useful yet :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Regex code block executes twice per match using look-arounds
by ikegami (Patriarch) on Jul 12, 2007 at 17:36 UTC

    A alternate solution, just for fun.

    use strict; use warnings; my $paren_grp_re = qr/ \( [^ ) ]* \) /x; my $square_grp_re = qr/ \[ [^ \] ]* \] /x; my $curly_grp_re = qr/ { [^ } ]* } /x; my $angle_grp_re = qr/ < [^ > ]* > /x; my $string = '<x1>[x2]{x3}(x4)'; print "Before: $string\n"; $string = join '+', map / $paren_grp_re | $square_grp_re | $curly_grp_re | $angle_grp_re | [^\(\[\{\<] /xg, $string; print "After: $string\n";

    Doesn't support nesting.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://626187]
Approved by Corion
Front-paged by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2024-04-18 15:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found