raygun has asked for the wisdom of the Perl Monks concerning the following question:

There is either a bug in perl, or in my understanding of the /n regular-expression option. (This is perl v5.24.3, in case behavior varies among releases.)
#!/usr/bin/perl $_ = 'abcdefg'; # define a string /(..)(..)(..)/; # break the string into 3 two-character chunks print "$1 $2 $3\n"; # print the chunks /(j)/n; # search for a letter that's not in the string print "$1 $2 $3\n"; # print the chunks /(g)/n; # search for a letter that's in the string print "$1 $2 $3\n"; # print the chunks
Because the latter two regular expressions use the /n option, they should have no effect on the existing values of $1, $2, and $3. And in fact the first one does not. But the second one erases all three.

I grant you, the documentation for /n is a little sketchy, saying only "Non-capture mode. Don't let () fill in $1, $2, etc..." It doesn't actually specify what happens to existing values of these variables. But I would expect consistent behavior: it should either always preserve the values (and this seems the ideal — what is ever gained from overwriting them when the user has asked that they not be populated?), or always erase them. Erasing them in the case of a successful match but not a failed one is at least unexpected, if it doesn't rise to the level of outright bug.

Or, I'm fundamentally misunderstanding something about /n. What says the wisdom of the monastery?

Replies are listed 'Best First'.
Re: non-capture mode sometimes erases previous capture
by Corion (Pope) on May 30, 2018 at 09:16 UTC

    This has nothing to do with non-capture mode.

    The capture variables $1 and so on only hold the value of the last successful pattern match, and I guess the logic of non-capture mode only comes after the logic that resets $1 on a successful match. See perlvar for $1, where the behaviour of resetting $1 etc. is discussed.

      Thanks, that explains why it works that way, but I still have my doubts whether that's the sanest approach. It just seems more useful to preserve any existing $1, $2, etc., when the user expressly says they don't want these variables populated. Looking at it another way, I can't think of a situation where the current behavior provides a benefit (though this could be a failure of my imagination), whereas I can think of situations where preserving the values is useful. Anyone agree, or am I off my rocker here?

        If you want to keep the captured bits, assign them to variables in your code, rather than relying on them being retained in global variables:

        my( $bit1, $bit2, $bit3 ) = /(..)(..)(..)/;

        Any code you call -- library routines etc. -- after you run the regex, but before you want to use its results, might also use the regex engine and overwrite those temporary global variables. So, don't rely on it.

        (Also, whether any of us agree with you or not, its been that way a very long time and isn't likely to change; so best get used to the idea :) )


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
        Even if these special variables did work the way you wish, BrowserUK's argument would still apply.
        Bill
Re: non-capture mode sometimes erases previous capture
by haukex (Bishop) on May 30, 2018 at 12:56 UTC
    the documentation for /n is a little sketchy, saying only "Non-capture mode. Don't let () fill in $1, $2, etc..."

    Then you're looking at perlop or perlreref - the central regexp documentation, perlre, is more specific. While it does include the IMO misleading "This modifier ... will stop $1, $2, etc... from being filled in", it goes on to say

    This is equivalent to putting ?: at the beginning of every capturing group

    ... which I hope clarifies the situation. If you write /([aeiou])(.)/; /([aeiou])(?:.)/;, do you expect $2 to keep its value from the first match? (I hope not, because it doesn't :-) ) Your regexes are the equivalent of /(..)(..)(..)/; /(?:j)/; /(?:g)/;. I hope it's becoming clear that the clearing of the match variables is pretty logical when you look at it this way.

    The rule Corion named still applies: Only rely on the values of $1-$N and the other special regex variables immediately after the successful pattern match that set them. Although in some programs, they may retain their value for a long time if you don't run another regex in the same scope, I would still strongly recommend against using them for more than a couple of lines after the regex - it's too easy to overlook when editing the code later, and someone may insert a second regex after the first one.

    As BrowserUk already said, if you want to keep match variables, the only reliable way to do so is by making a copy.

    BTW, if you're doing complex stuff with regexes, you may want to look into named capture groups and the %+ variable (which you also have to make a copy of if you want to keep it).

      Thank you all for the responses. I'm convinced this is working as designed, not entirely convinced that the design is ideal, and fully convinced that the design won't change at this stage.

      In addition to the sentence in perlre that haukex quotes, that document also says, "Capture group contents are ... available to you outside the pattern until the end of the enclosing block or until the next successful match, whichever comes first." Lesson: read documentation more thoroughly before posting.

      It would be nice for s/// to have an option that means "don't capture and don't alter the existing values of $1, $2, etc." And it wouldn't break much in terms of back-compatibility if /n were this option: with the current behavior being to undefine all these variables when /n is used, extant code can't be relying on them for anything after a /n substitution. (I suppose some code might rely on them being undefined, but that seems a rare edge case.) Still, at the end of the day, if it ain't broke, don't fix it. There are other ways to save $1, so this hypothetical option might be an occasional convenience but certainly isn't a necessity.

      I would still strongly recommend against using them for more than a couple of lines after the regex
      Completely agree. I encountered this issue when using the variable in the target of the substitution, in an eval that ran another simple substitution. So it wasn't really any lines after the regex that populated $1.