Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Bug or feature? s/// and the g option

by Belgarion (Chaplain)
on Oct 14, 2007 at 15:29 UTC ( [id://644761]=perlquestion: print w/replies, xml ) Need Help??

Belgarion has asked for the wisdom of the Perl Monks concerning the following question:

I happened across Marco d'Itri's blog post regarding the s/// operator with the "g" option. I did a few tests, and I'm completely stumped. It seems like this is a bug.
#!/usr/bin/perl use strict; use warnings; my ($test, $count); # WORKS $test = <<'END'; ABCD DEFG END $count = ($test =~ s/(BC?[DE])//gi); print "Removed: <<$1>>\nCOUNT: $count\n"; # THIS ALSO WORKS $test = <<'END'; ABCD AEFG END $count = ($test =~ s/(BC?[DE])//gi); print "Removed: <<$1>>\nCOUNT: $count\n"; # FAILS $test = <<'END'; ABCD ABFG END $count = ($test =~ s/(BC?[DE])//gi); print "Removed: <<$1>>\nCOUNT: $count\n";
As soon as the two lines both begin with the same pair of characters, $1 is no longer defined. What am I missing here?

Replies are listed 'Best First'.
Re: Bug or feature? s/// and the g option
by dsheroh (Monsignor) on Oct 14, 2007 at 15:39 UTC
    I'd guess that it's undefined because, in the final example, the last partial match (B...) fails and apparently clears $1 before doing so. Reversing the order of those lines works without returning the warning:
    $test = <<'END'; ABFG ABCD END
    Edit: I poked at it a little more and seem to have confirmed my theory. If the final B in the data string is not followed by C?[DE], $1 ends up undef, regardless of where that B is.
      Interesting. So a failed match will clear the $1 variable. I'm still a little stumped by Marco's original example:
      $test = <<'END'; XXWY XXWZ END $count = $test =~ s#(XXW?Y)##gi; print "REMOVED: <<$1>>\nCOUNT: $count\n";
      That fails. But remove the "i" in the substitution and $1 is defined. Why would case-insensitive make any difference?
        AIUI, without the i, before the regex engine proper is entered, there's an optimization that makes it search for an XX followed later by a Y. That optimization rejects a match, and when that happens, it bypasses the bug that's setting $1 to undef. Seems to be fixed for 5.10.0.
Re: Bug or feature? s/// and the g option
by graff (Chancellor) on Oct 14, 2007 at 16:17 UTC
    Personally, I wouldn't consider it a bug, but rather a constraint on the use of capturing parens and references to captures in the context of the "g" modifier: the "$1,$2,..." can only be used reliably in the replacement side of s///g, and cannot be counted on as defined outside the scope of that operator.
      Agreed. I can't readily think of any time that it would be particularly useful to say "do a bunch of replacements, then tell me what the last thing replaced was" - you'd normally want to either see all the replacements (by using $1 inside a loop on the regex) or none of them (by not using $1 at all).

      While the OP's code brings out an interesting quirk, I think I'd call it undefined behaviour rather than a bug or a feature.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Bug or feature? s/// and the g option
by ysth (Canon) on Oct 14, 2007 at 20:07 UTC
    To those who call this "undefined behavior":

    Just because it's not a very useful thing to do, doesn't mean it's not a bug. $1, etc., shouldn't be affected by unsuccessful matches. This is clearly spelled out in $& (which suffers the same bug as $1) even if $<_digits_> is for some reason missing the "successful" qualification.

      Especially don't call it a feature.

      Some people might try to categorize this as hidden feature. My thought: although not stated itself, hidden feature cannot violate what's stated, i.e. must be logically consistent with the rest.

      The best way to verify whether something is a feature, obviously is to ask the creators, but if that's not feasible, my above logic applies.

Re: Bug or feature? s/// and the g option
by Krambambuli (Curate) on Oct 14, 2007 at 16:57 UTC
    I'm curious what the RE gurus looking into the monastery will say. However, so far, it seems to me to be neither a bug nor a feature, but just an oddity that comes from an somewhat unfortunate [mis]use of /g.

    As I understand it, /g is not a substitute for /sm - it is sort of an iterator that lets you steadily step through the matches in a string, if you need a step-by-step, match-by-match approach. See it in conjunction with pos.

    Successive, iterated substitution - which /g seem to imply, are clearly weird: the intermediate string resulting after every substitution is something different from both the initial string as from the final result, and there is by no means any intent to use it for anything other then as 'something' unfinished.

    Think about something like
    my $test = 'AAAA'; my $x1 = $1 if $test =~ s/A/AA/g; my $x2 = $1 if $test =~ s/A/AA/g; ...
    Would you expect any intermediate results ?! I think not, and so I wouldn't expect anything from $1,$2,... after a s//g, similar to like I don't really trust for example a for-loop control variable to be something I can rely on once the loop has finished. I remain curious about what others think/know about it.

      As I understand it, /g is not a substitute for /sm

      print is not a substitute for system. True, but obvious.

      In fact, not only are they not equivalent, s, m and g are orthgonal.
      s affects what . matches.
      m affects what ^ and $ matches.
      g affects the number of substitutions that will be made.

      /g is sort of an iterator that lets you steadily step the matches in a string

      No. You're thinking of m//g in void and scalar context. That's neither the case for m//g in list context nor for s///g.

      my $x1 = $1 if $test =~ s/A/AA/g;

      Off-topic, but my $var if ... is wrong. my has a run-time effect, so it shouldn't be executed conditionally.

        Thank you, ikegami - all objections gratefully accepted.

        Turning back to the original question: so _should_ $1 contain something you could normally rely upon after a successfull s///g ?
Re: Bug or feature? s/// and the g option
by rowdog (Curate) on Oct 15, 2007 at 08:46 UTC

    In my mind, this is a documentation bug. If you match instead of substitute, you'll get the expected behavior. Also, the substitution does, in fact, substitute exactly what you ask it to.

    I'm grateful that this question led me to spend a couple hours re-reading perlre et al but I never did find anything that explained this behavior. This could easily be an oversight on my part.

    To return to the original question, I don't think this is either a bug or a feature but, rather, a design decision: when replacing multiple times, should $1 contain the value of the last successful match or should it be undef to reflect the fact that the last match failed? I wouldn't presume to say that the wrong choice was made but it's obvious what that choice was.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://644761]
Approved by holli
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2024-07-25 00:10 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.