johngg has asked for the wisdom of the Perl Monks concerning the following question:
I've come across something a bit strange when using substitution at a point determined by look-behind and -ahead assertions. I used a regex code block to print some diagnostic text whenever a match occurred but found to my surprise that the code block seemed to be called twice for each match. Here's a small script that demonstrates the behaviour. It is just inserting a plus sign into a string at any point between pseudo-tags with different sorts of bracket pairs.
use strict;
use warnings;
use re q{eval};
sub doSep { print q{-} x 40, qq{\n}; }
my $count = 0;
my $string = q{<x1>[x2]{x3}(x4)};
my $rxClose = qr@[]>})]@;
my $rxOpen = qr@[[<{(]@;
my $rxBetween = qr
{(?x)
(?<=($rxClose))
(?=($rxOpen))
(?{print qq{Match @{ [++ $count] }: left $1, right $2\n}})
};
print qq{ Before: $string\n};
doSep();
$string =~ s{$rxBetween}{+}g;
doSep();
print qq{ After: $string\n};
Even though the code block seems to execute twice per match the substitution is only done once. Here's the output, six apparent matches, weird.
Before: <x1>[x2]{x3}(x4)
----------------------------------------
Match 1: left >, right [
Match 2: left >, right [
Match 3: left ], right {
Match 4: left ], right {
Match 5: left }, right (
Match 6: left }, right (
----------------------------------------
After: <x1>+[x2]+{x3}+(x4)
If I change the code so that look-arounds are not used the strange behaviour does not happen and the code block is only executed once per match.
use strict;
use warnings;
use re q{eval};
sub doSep { print q{-} x 40, qq{\n}; }
my $count = 0;
my $string = q{<x1>[x2]{x3}(x4)};
my $rxClose = qr@[]>})]@;
my $rxOpen = qr@[[<{(]@;
my $rxBetween = qr
{(?x)
($rxClose)
($rxOpen)
(?{print qq{Match @{ [++ $count] }: left $1, right $2\n}})
};
print qq{ Before: $string\n};
doSep();
$string =~ s{$rxBetween}{$1+$2}g;
doSep();
print qq{ After: $string\n};
The output this time, just three matches as expected.
Before: <x1>[x2]{x3}(x4)
----------------------------------------
Match 1: left >, right [
Match 2: left ], right {
Match 3: left }, right (
----------------------------------------
After: <x1>+[x2]+{x3}+(x4)
I have done other testing to try and pin down what is happening. The code block also executes twice if I am just matching using look-arounds rather than substituting. I have also tried using just a single look-behind rather than both lokk-behind and -ahead but still get the double execution.
Can any Monk throw some light on what is happening here? Cheers, JohnGG
Re: Regex code block executes twice per match using look-arounds
by jettero (Monsignor) on Jul 12, 2007 at 10:49 UTC
|
That is actually the usual behavior. The regex is usually implemented as a DFA (Finite State Machine), and occasionally they have to backtrack to complete the match. Your code is executed at the time the DFA moves through your (?{}) block even if it doesn't make it to the end which is when the DFA "matches."
In this case, it is your lookbehind that's causing the DFA to remember something, wander forward to see if it's relevant, and wander back when it isn't... If you put your (?{}) block at the very end it would only print when it actually finds a match.
| [reply] |
|
Thank you for the reply. I'm obviously not understanding something as I thought my (?{...}) block was at the very end of the regular expression. I didn't think it would be encountered and triggered unless both look-behind and look-ahead succeeded. I expected the matching to go something like this
Initialise pointer to beginning of string
test look-behind, nothing preceding so fails
advance pointer one place
test look-behind, '{' so fails
advance pointer one place
test look-behind, 'x' so fails
advance pointer one place
test look-behind, '1' so fails
advance pointer one place
test look-behind, '}' so succeeds
test look-ahead, '[' also succeeds
code block encountered, execute
advance pointer one place
test look-behind, '[' so fails
...
If the DFA had to look past my code block to check something else and then backtrack, the behaviour would make sense. However, I can't see how that is happening in this case.
I'm sure it must be me failing to get my head around something fundamental. Please could you point out where my understanding is lacking. Cheers, JohnGG
| [reply] [d/l] [select] |
|
advance pointer the length of the match
The length of the match is zero in the case where the lookahead and lookbehind is used. Since the pointer is not advanced, the regexp matches everything twice. Only a final check prevents the regexp from returning an identical match.
test look-behind, '}' so succeeds
test look-ahead, '[' also succeeds
code block encountered, execute
same match? no, continue
advance pointer the length of the match (0)
test look-behind, '}' so succeeds
test look-ahead, '[' also succeeds
code block encountered, execute
same match? yes, backtrack
More on this in Re: Regex code block executes twice per match using look-arounds.
| [reply] [d/l] [select] |
|
|
Yes, well, the zero-width of the look behind and look ahead is clearly causing it to do something other than the obvious. I suspect the NFA (thanks sgt, interesting) is looking for something after the zero width thingies... Perhaps it would behave less oddly if you put a '.' before your code? Who really knows what the *FA is really doing?
Any way you look at it, your regex is unusual since all of the expressions are zero-width, so they kinda match? And the when of code embedded in the regex isn't all that well defined I bet since the perlre page calls it experimental...
I don't think it's changed in the last 5 years. I wonder when it'll stop being experimental.
| [reply] |
|
Just a minor pick: perl (ir)regular expressions are usually put in the NFA category
meaning essentially some pathological (and not-so pathological) patterns can run for a long time (like double loops: '(blah.*)+'). This is a trade-off for getting more useful features.
The Owl book (Mastering Regular Expressions) talks about 3 categories: DFA, NFA, POSIX NFA. In an alternation pattern, perl uses the first alternation to match, without checking for a possible longer match, making it non POSIX.
There was an interesting discussion in p5p a few months ago motivated by the following article regex matching can be simple and fast but...
One conclusion was that using some hybrid DFA/NFA scheme like the one used by gawk which uses GNU regex library could be nice, when you want to garantee a decent running time and are not using features that force the NFA engine to kick in.
You'll get the best of all worlds with 5.10 that allows pluggable regex libraries :)
cheers
--stephan Note: the compiled form of a regex is usually quite different in both schemes, furthermore the mathematical equivalence between NFA and DFA possibly useful for simple models of regex (few operators) is not used.
| [reply] |
Re: Regex code block executes twice per match using look-arounds
by ikegami (Patriarch) on Jul 12, 2007 at 18:21 UTC
|
The code block also executes twice if I am just matching using look-arounds rather than substituting. ... Can any Monk throw some light on what is happening here?
There's one more check after your (?{ print ... }).
Right before declaring a match, Perl does an extra check when g is used. Perl checks if the starting position of the match (pos) and the length of the match is the same as the last loop through the regexp. If it is, the match is nulified, pos is incremented, and another match is attempted.
Consider, the following two snippets.
while ('abc' =~ /(.)/g) {
print($1);
}
while ('abc' =~ /(?=(.))/g) {
print($1);
}
In the first snippet, the regexp matches a character every pass, so pos is always incremented by one (the length of the match) every pass.
In the second snippet, the regexp matches zero characters every pass (since (?=...) always matches zero characters when it matches), so pos is incremented by zero (the length of the match) every pass.
Therefore, without the afformentioned check, pos will never advance beyond zero in the second snippet, printing an infinite stream of as. With the check, Perl realizes it matched pos=0 len=0 twice and takes action to correct that.
You can see that effect by running the following snippet.
$_ = 'abc';
while (/
( (?=.) )
(?{ printf("pos=%d len=%d\n", pos($_), length($1)); })
/xg) {
print("Match\n");
}
print("\n");
while (/
( . )
(?{ printf("pos=%d len=%d\n", pos($_), length($1)); })
/xg) {
print("Match\n");
}
pos=0 len=0
Match
pos=0 len=0 # Same pos & len as last pass -> No match.
pos=1 len=0
Match
pos=1 len=0 # Same pos & len as last pass -> No match.
pos=2 len=0
Match
pos=2 len=0 # Same pos & len as last pass -> No match.
pos=1 len=1
Match
pos=2 len=1
Match
pos=3 len=1
Match
| [reply] [d/l] [select] |
Re: Regex code block executes twice per match using look-arounds
by mrpeabody (Friar) on Jul 12, 2007 at 18:54 UTC
|
You can get some insight into how this happens in your code with use re 'debug'. I've altered the target string to {x1}[x2] for simplicity.
After several unsuccessful attempts, the engine finds a match starting at position 4:
Setting an EVAL scope, savestack=25
4 <{x1}> <[x2]> | 1: IFMATCH[-1]
3 <{x1> <}[x2]> | 3: OPEN1
3 <{x1> <}[x2]> | 5: ANYOF[)>\]}]
4 <{x1}> <[x2]> | 16: CLOSE1
4 <{x1}> <[x2]> | 18: SUCCEED
could match...
4 <{x1}> <[x2]> | 20: IFMATCH[-0]
4 <{x1}> <[x2]> | 22: OPEN2
4 <{x1}> <[x2]> | 24: ANYOF[(<[{]
5 <{x1}[> <x2]> | 35: CLOSE2
5 <{x1}[> <x2]> | 37: SUCCEED
could match...
4 <{x1}> <[x2]> | 39: EVAL
re_eval 0x10033008
Match 1: left }, right [
4 <{x1}> <[x2]> | 41: END
Match successful!
Then it starts again at position 4, and matches again but then throws it away:
Setting an EVAL scope, savestack=37
4 <{x1}> <[x2]> | 1: IFMATCH[-1]
3 <{x1> <}[x2]> | 3: OPEN1
3 <{x1> <}[x2]> | 5: ANYOF[)>\]}]
4 <{x1}> <[x2]> | 16: CLOSE1
4 <{x1}> <[x2]> | 18: SUCCEED
could match...
4 <{x1}> <[x2]> | 20: IFMATCH[-0]
4 <{x1}> <[x2]> | 22: OPEN2
4 <{x1}> <[x2]> | 24: ANYOF[(<[{]
5 <{x1}[> <x2]> | 35: CLOSE2
5 <{x1}[> <x2]> | 37: SUCCEED
could match...
4 <{x1}> <[x2]> | 39: EVAL
re_eval 0x10033008
Match 2: left }, right [
4 <{x1}> <[x2]> | 41: END
Match possible, but length=0 is smaller than requested=1, failing!
The whole output looks like this:
| [reply] [d/l] [select] |
|
Thank you for your reply. As you can see from my reply to ikegami we have been thinking in parallel. I have tried to get my head around the output from re debug before but found the learning curve very steep. Today in the documentation I spotted use re 'debugcolor'; which gave me enough of a helping hand (even in black and white) to start to penetrate the mystery. I have been meaning to get the Owl Book mentioned by sgt for some time but today has shown that I have a long way to go in my understanding; buying the book is now a priority.
Cheers, JohnGG
| [reply] [d/l] |
|
Yeah, it's cryptic. There's some info on interpreting it in perldebguts .
| [reply] |
Re: Regex code block executes twice per match using look-arounds
by BrowserUk (Patriarch) on Jul 12, 2007 at 20:05 UTC
|
I've also been trying to interprete the output from use re 'debug';, but the message "Match possible, but length=0 is smaller than requested=1, failing!" has me stumped.
But I did find a workaround for the problem. Adding a non-greedy 0 or 1 quantifier to the code assertion prevents it from being executed twice:
my $rxBetween = qr/
(?<= []>})] )
(?= [[<{(] )
(?{ print "'$`'$&'$''" })??
/x;
This simplified (to reduce the trace output) version of your regex now produces
C:\test>junk6 2>junk6.trace
Before: (x1)[x2]{x3}(x4)
----------------------------------------
'(x1)''[x2]{x3}(x4)'
'(x1)[x2]''{x3}(x4)'
'(x1)[x2]{x3}''(x4)'
----------------------------------------
After: (x1)+[x2]+{x3}+(x4)
I've had a go at coming up with an explanation ,from looking at the way it changes the debug trace...but I decided that I am well up on my mental torture quota for this month, so I'll let someone else have that pleasure :)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
$ perl -le '
> $str = q{abc};
> $str =~ m{(b)(?{print qq{Match $1}}){3}};'
Match b
Match b
Match b
$
Thanks again, JohnGG | [reply] [d/l] |
|
I would never have thought of trying that.
You'll note that I carefully omitted any description of the tortured, and almost certainly erroneous, logic that led me to try such a thing in the first place.
Since 'discovering' the idea of quantifiers on zero-length assertions, I've been trying to think of a good use for it. Other than simulating a stutter, I haven't thought of anything useful yet :)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: Regex code block executes twice per match using look-arounds
by ikegami (Patriarch) on Jul 12, 2007 at 17:36 UTC
|
A alternate solution, just for fun.
use strict;
use warnings;
my $paren_grp_re = qr/ \( [^ ) ]* \) /x;
my $square_grp_re = qr/ \[ [^ \] ]* \] /x;
my $curly_grp_re = qr/ { [^ } ]* } /x;
my $angle_grp_re = qr/ < [^ > ]* > /x;
my $string = '<x1>[x2]{x3}(x4)';
print "Before: $string\n";
$string = join '+',
map / $paren_grp_re
| $square_grp_re
| $curly_grp_re
| $angle_grp_re
| [^\(\[\{\<]
/xg,
$string;
print "After: $string\n";
Doesn't support nesting.
| [reply] [d/l] |
|
|