Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Strange behavior of @- and @+ in perl5.10 regexps

by demerphq (Chancellor)
on Sep 13, 2009 at 12:40 UTC ( #794985=note: print w/ replies, xml ) Need Help??


in reply to Strange behavior of @- and @+ in perl5.10 regexps

This actually is not a bug. It is just a slightly counter-intuitive result of how @+/@-, (?(DEFINE) ..) and named-captures/named-subroutines all work, and probably could have been implemented slightly differently without any harm, but as of now, the behaviour probably cannot be changed.

First, I modified a version of your code from Re^4: Strange behavior of @- and @+ in perl5.10 regexps:

use v5.10; my $input; local $" = ", "; my $parser = qr{ (?{ say "before:\n\@- = (@-)\t\t ".scalar(@-)." items\n\@+ = (@+) +\t ".scalar(@+)." items\n"; }) ^ ((?&expr)) ((?&expr)) \z (?{ say "after:\n\@- = (@-)\t\t ".scalar(@-)." items\n\@+ = (@+)\ +t ".scalar(@+)." items\n"; }) (?(DEFINE) (?<expr> (.) (.) (?{ say "expr:\n\@- = (@-)\t ".scalar(@-)." items\n\@+ = +(@+)\t ".scalar(@+)." items\n"; }) ) ) }x; $input = "abcd"; chomp($input); if ($input =~ $parser) { say "matches: ($&)"; say "At the very end:\n\@- = (@-)\t ".scalar(@-)." items\n\@+ = (@+) +\t ".scalar(@+)." items\n"; } __END__

The pattern compiles down to the following:

Compiling REx "%n (?{%n say %"before:\n\@- = (@-)\t\t %".sc +alar("... synthetic stclass "ANYOF[\0-\11\13-\377][{unicode_all}]". Final program: 1: EVAL (3) 3: BOL (4) 4: OPEN1 (6) 6: GOSUB3[+19] (9) 9: CLOSE1 (11) 11: OPEN2 (13) 13: GOSUB3[+12] (16) 16: CLOSE2 (18) 18: EOS (19) 19: EVAL (21) 21: DEFINEP (23) 23: IFTHEN (44) 25: OPEN3 'expr' (27) 27: OPEN4 (29) 29: REG_ANY (30) 30: CLOSE4 (32) 32: OPEN5 (34) 34: REG_ANY (35) 35: CLOSE5 (37) 37: EVAL (39) 39: CLOSE3 'expr' (44) 41: LONGJMP (43) 43: TAIL (44) 44: END (0) floating ""$ at 2..2147483647 (checking floating) stclass ANYOF[\0-\11 +\13-\377][{unicode_all}] minlen 2 with eval

Which outputs:

before: @- = (0) 1 items @+ = (0, , , , , ) 6 items expr: @- = (0, , , , 0, 1) 6 items @+ = (2, , , , 1, 2) 6 items expr: @- = (0, 0, , , 2, 3) 6 items @+ = (4, 2, , , 3, 4) 6 items after: @- = (0, 0, 2) 3 items @+ = (4, 2, 4, , , ) 6 items matches: (abcd) At the very end: @- = (0, 0, 2) 3 items @+ = (4, 2, 4, , , ) 6 items

So, first, if you look at Perl 5.10.x perlvar under @- and @+ you will see the following documentation. I have bolded the relevent bits.

@LAST_MATCH_END
@+

This array holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. $+[0] is the offset into the string of the end of the entire match. This is the same value as what the pos function returns when called on the variable that was matched against. The nth element of this array holds the offset of the nth submatch, so $+[1] is the offset past where $1 ends, $+[2] the offset past where $2 ends, and so on.

You can use $#+ to determine how many subgroups were in the last successful match.

See the examples given for the "@-" variable.

@LAST_MATCH_START
@-

$-[0] is the offset of the start of the last successful match. $-[$n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.

Thus after a match against $_, $& coincides with substr $_, $-[0], $+[0] - $-[0]. Similarly, $n coincides with substr $_, $-[n], $+[n] - $-[n] if $-[n] is defined, and $+ coincides with substr $_, $-[$#-], $+[$#-] - $-[$#-]. One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression. Compare with @+.

This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The nth element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.

After a match against some variable $var:

$` is the same as substr($var, 0, $-[0])
$& is the same as substr($var, $-[0], $+[0] - $-[0])
$' is the same as substr($var, $+[0])
$1 is the same as substr($var, $-[1], $+[1] - $-[1])
$2 is the same as substr($var, $-[2], $+[2] - $-[2])
$3 is the same as substr($var, $-[3], $+[3] - $-[3])

Now, you may wonder, ok, well then, "why six elements"? Because it is not at first obvious, as it appears there are only four capture buffers being used in the pattern, so there should be five slots used (the zeroth element is used to track $&). However there are actually five capture buffers in this pattern, as one is reserved for the (?<expr> ... ), although it doesn't get set because it is in the (?(DEFINE) (?<expr>...)) and is only ever executed as (?&expr) which actually never executes the /capture/ part of the (?<expr> ... ) so the 4th slot of the pattern never gets populated.

This was actually a deliberate design decision, consider that it would be awkward if /(?<foo>foo)((?&foo))/ resulted in $1 and $2 pointing at the same string, however maybe what happens to a capture buffer defined in a DEFINE block should have been reviewed once (?(DEFINE) ...) was introduced. The development of these features was somewhat organic, with a lot of it actually just being "tricks", for instance (?(DEFINE) ... ) isn't really special, at heart it is just an optimized alias of (?(0) ... ), (with some error checking to disallow an ELSE block), and subroutines just piggy back on named capture, so... Well, as is sometimes said of Perl core-dev, its all a bit of a game of Jenga. :-)

While it might be arguable that there should not be a slot reserved for a named capture buffer defined in a (?(DEFINE) ... ) block, the fact that @- and @+ are not the same size is a deliberate choice, and the behaviour you are seeing is expected, although admittedly in this context the results are bit odd looking.

HTH

Note:I rejected the bug report you filed on this, thanks anyway. It did raise an interesting question that I will think on.

---
$world=~s/war/peace/g


Comment on Re: Strange behavior of @- and @+ in perl5.10 regexps
Select or Download Code
Re^2: Strange behavior of @- and @+ in perl5.10 regexps
by demerphq (Chancellor) on Sep 13, 2009 at 13:00 UTC

    Oh, just to pre-empt the question, "so why does @- have 6 elements inside of (?&expr) but not outside it", which I suspect is likely to come up.

    The answer is that effectively the values of $4 and $5 are localized to the scope of the (?&expr) "subroutine", so once the subpattern matches and does its "return" back to the previous context they are reverted to their previous undefined value. Again this is for good reasons, try using a subroutine in a pattern that ISNT defined in a (?(DEFINE) ... ) and play around with it. In that case you very much don't want the use of a named capture as a subroutine to pollute its use as a named capture.

    ---
    $world=~s/war/peace/g

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://794985]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2014-07-12 05:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (238 votes), past polls