Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Strange behavior of @- and @+ in perl5.10 regexps

by casiano (Pilgrim)
on Sep 11, 2009 at 10:13 UTC ( #794736=perlquestion: print w/ replies, xml ) Need Help??
casiano has asked for the wisdom of the Perl Monks concerning the following question:

When using the offset arrays @+ and @- inside perl5.10 regexps I observe some strange behavior. See the following code:
pl@nereida:~/Lperltesting$ cat ./offsetsin5_10.pl #!/usr/local/lib/perl/5.10.1/bin//perl5.10.1 use v5.10; my $input; local $" = ", "; my $parser = qr{ ^ ((?&expr)) ((?&expr)) \z (?{ say "main:\n\@- = (@-)\t\t ".scalar(@-)." items\n\@+ = (@ ++)\t ".scalar(@+)." items\n"; }) (?(DEFINE) (?<expr> (.) (.) (?{ say "expr:\n\@- = (@-)\t ".scalar(@-)." items\n\@+ = +(@+)\t ".scalar(@+)." items\n"; }) ) ) }x; $input = <>; chomp($input); if ($input =~ $parser) { say "matches: ($&)";
When it is executed with input abab produces this output:
pl@nereida:~/Lperltesting$ ./offsetsin5_10.pl abab expr: @- = (0, , , , 0, 1) 6 items @+ = (2, , , , 1, 2) 6 items expr: @- = (0, 0, , , 2, 3) 6 items @+ = (4, 2, , , 3, 4) 6 items main: @- = (0, 0, 2) 3 items @+ = (4, 2, 4, , , ) 6 items matches: (abab)
Observe how in the outside scope (main) the @- and @+ arrays have different lengths. It looks as if @+ must have length 3 but it has length 6 instead.

Am I doing something wrong or is it a bug?

Comment on Strange behavior of @- and @+ in perl5.10 regexps
Select or Download Code
Re: Strange behavior of @- and @+ in perl5.10 regexps
by casiano (Pilgrim) on Sep 11, 2009 at 10:30 UTC
    Or may be both must have length 6?.

    Since it looks as it starts working by successively substituting the right hand sides of the definitions the former main regexp

    ^((?&expr))((?&expr))\z
    is translated to:

    ((.)(.))((.)(.))
    which has 6 parenthesis
      A matching regexp with 6 sets of capturing parenthesis leads to @- having seven elements.
Re: Strange behavior of @- and @+ in perl5.10 regexps
by JavaFan (Canon) on Sep 11, 2009 at 10:54 UTC
    I don't think @- and @+ are guaranteed to contain anything meaningful during a match, so it would be hard to say there's a bug.
      Thanks JavaFan,

      ... I don't think @- and @+ are guaranteed to contain anything meaningful during a match
      But it would be useful if they do so. They will give you the opportunity to access the attributes of previous sections inside embedded code (see ikegami answer in node Backreference variables in code embedded inside Perl 5.10 regexps) and mimic Parse::Recdescent programming style.
        But it would be useful if they do so.
        Yes, but that doesn't mean the current (non)behaviour is a bug.

        Wishes are nice, but at the moment there's only just one guy doing any serious work on the regexp engine. And he's swamped already.

        Patches will always be welcome.

      Hmm, probably not documented directly and might not be tested explicitly. But certainly indirectly. We have lots of tests that $1 and friends behave "as expected" inside of (?{ ... }) and (??{ ... }) blocks. So effectively that means that @- and @+ have to as well, as they are all just ties into the same C level data structures.

      Now, at a certain level these constructs are still documented as experimental or subject to change so technically you have a point, and I appreciate that you pointed this out.

      But I personally would/do see problems with the magic variables inside of these constructs as a bugs, the experimental status just says I get to change my mind if I want. :-) However in this case things are working pretty much exactly as planned, with the possible nit as to whether (?<expr> ... ) should have a slot allocated to it that never gets used. Which is mostly irritating as it is wasteful, and a little counter-intuitive, but actually expected behaviour.

      ---
      $world=~s/war/peace/g

Re: Strange behavior of @- and @+ in perl5.10 regexps
by demerphq (Chancellor) on Sep 13, 2009 at 12:40 UTC

    This actually is not a bug. It is just a slightly counter-intuitive result of how @+/@-, (?(DEFINE) ..) and named-captures/named-subroutines all work, and probably could have been implemented slightly differently without any harm, but as of now, the behaviour probably cannot be changed.

    First, I modified a version of your code from Re^4: Strange behavior of @- and @+ in perl5.10 regexps:

    Which outputs:

    before: @- = (0) 1 items @+ = (0, , , , , ) 6 items expr: @- = (0, , , , 0, 1) 6 items @+ = (2, , , , 1, 2) 6 items expr: @- = (0, 0, , , 2, 3) 6 items @+ = (4, 2, , , 3, 4) 6 items after: @- = (0, 0, 2) 3 items @+ = (4, 2, 4, , , ) 6 items matches: (abcd) At the very end: @- = (0, 0, 2) 3 items @+ = (4, 2, 4, , , ) 6 items

    So, first, if you look at Perl 5.10.x perlvar under @- and @+ you will see the following documentation. I have bolded the relevent bits.

    @LAST_MATCH_END
    @+

    This array holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. $+[0] is the offset into the string of the end of the entire match. This is the same value as what the pos function returns when called on the variable that was matched against. The nth element of this array holds the offset of the nth submatch, so $+[1] is the offset past where $1 ends, $+[2] the offset past where $2 ends, and so on.

    You can use $#+ to determine how many subgroups were in the last successful match.

    See the examples given for the "@-" variable.

    @LAST_MATCH_START
    @-

    $-[0] is the offset of the start of the last successful match. $-[$n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.

    Thus after a match against $_, $& coincides with substr $_, $-[0], $+[0] - $-[0]. Similarly, $n coincides with substr $_, $-[n], $+[n] - $-[n] if $-[n] is defined, and $+ coincides with substr $_, $-[$#-], $+[$#-] - $-[$#-]. One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression. Compare with @+.

    This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The nth element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.

    After a match against some variable $var:

    $` is the same as substr($var, 0, $-[0])
    $& is the same as substr($var, $-[0], $+[0] - $-[0])
    $' is the same as substr($var, $+[0])
    $1 is the same as substr($var, $-[1], $+[1] - $-[1])
    $2 is the same as substr($var, $-[2], $+[2] - $-[2])
    $3 is the same as substr($var, $-[3], $+[3] - $-[3])

    Now, you may wonder, ok, well then, "why six elements"? Because it is not at first obvious, as it appears there are only four capture buffers being used in the pattern, so there should be five slots used (the zeroth element is used to track $&). However there are actually five capture buffers in this pattern, as one is reserved for the (?<expr> ... ), although it doesn't get set because it is in the (?(DEFINE) (?<expr>...)) and is only ever executed as (?&expr) which actually never executes the /capture/ part of the (?<expr> ... ) so the 4th slot of the pattern never gets populated.

    This was actually a deliberate design decision, consider that it would be awkward if /(?<foo>foo)((?&foo))/ resulted in $1 and $2 pointing at the same string, however maybe what happens to a capture buffer defined in a DEFINE block should have been reviewed once (?(DEFINE) ...) was introduced. The development of these features was somewhat organic, with a lot of it actually just being "tricks", for instance (?(DEFINE) ... ) isn't really special, at heart it is just an optimized alias of (?(0) ... ), (with some error checking to disallow an ELSE block), and subroutines just piggy back on named capture, so... Well, as is sometimes said of Perl core-dev, its all a bit of a game of Jenga. :-)

    While it might be arguable that there should not be a slot reserved for a named capture buffer defined in a (?(DEFINE) ... ) block, the fact that @- and @+ are not the same size is a deliberate choice, and the behaviour you are seeing is expected, although admittedly in this context the results are bit odd looking.

    HTH

    Note:I rejected the bug report you filed on this, thanks anyway. It did raise an interesting question that I will think on.

    ---
    $world=~s/war/peace/g

      Oh, just to pre-empt the question, "so why does @- have 6 elements inside of (?&expr) but not outside it", which I suspect is likely to come up.

      The answer is that effectively the values of $4 and $5 are localized to the scope of the (?&expr) "subroutine", so once the subpattern matches and does its "return" back to the previous context they are reverted to their previous undefined value. Again this is for good reasons, try using a subroutine in a pattern that ISNT defined in a (?(DEFINE) ... ) and play around with it. In that case you very much don't want the use of a named capture as a subroutine to pollute its use as a named capture.

      ---
      $world=~s/war/peace/g

Re: Strange behavior of @- and @+ in perl5.10 regexps
by casiano (Pilgrim) on Sep 14, 2009 at 07:58 UTC
    Many thanks for your kind answers,

    It has been most helpful,

    I am absolutely convinced it is not a bug:
    I just have read the email from Abigail where he points out that the same behavior appears in previous versions of Perl!:

    pl@nereida:~$ perl -v This is perl, v5.8.8 built for x86_64-linux-gnu-thread-multi Copyright 1987-2006, Larry Wall
    Perl v5.8.8 regexp engine also produces @+ and @- of different sizes:
    pl@nereida:~$ perl -wde 0 Loading DB routines from perl5db.pl version 1.28 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(-e:1): 0 DB<1> "a" =~ /(a)|(b)/; print ((scalar @-)."\n"); print ((scalar @+)." +\n") 2 3

    Many thanks for your help

Re: Strange behavior of @- and @+ in perl5.10 regexps
by casiano (Pilgrim) on Sep 14, 2009 at 09:21 UTC
    Both the documentation of 5.8 and 5.10 point out the fact
    @LAST_MATCH_END @+ ... You can use $#+ to determine how many subgroups were in the last successful match. ... @LAST_MATCH_START @- ... One can use "$#-" to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression.
    When running the following program with perl v5.8.8 or v5.10.1
    pl@nereida:~/Lperltesting$ cat abigail1.pl #!/usr/bin/perl local $" = ", "; "a" =~ /(a)|(b)/; print "\@- = (@-)\t length of \@- = ".((scalar @-)."\t last - index = +$#-\n"); print "\@+ = (@+)\t length of \@+ = ".((scalar @+)."\t last + index = +$#+\n")
    produces the same output:
    pl@nereida:~/Lperltesting$ perl5.10.1 ./abigail1.pl @- = (0, 0) length of @- = 2 last - index = 1 @+ = (1, 1, ) length of @+ = 3 last + index = 2 pl@nereida:~/Lperltesting$ perl5.8.8 ./abigail1.pl @- = (0, 0) length of @- = 2 last - index = 1 @+ = (1, 1, ) length of @+ = 3 last + index = 2
    and so the behavior of the perl 5.10 regexp engine is perfectly right. The last matched subgroup is subgroup 1 but there were 2 groups in the last successful match.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://794736]
Approved by broomduster
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (9)
As of 2014-08-27 10:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (236 votes), past polls