raygun has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks, please shine your wisdom on this doubtlessly elementary problem.

The regular expression in this code sample is designed to work on any line that begins with a tilde character, and in such a line, replace the first underscore with a plus sign.

#!/usr/bin/perl while (<DATA>) { s/^ ~ .*? \K _ /+/x; print "$_"; } __DATA__ A line with an_underscore. A line with_two_underscores. ~A line with an_underscore starting with a tilde. ~A line with_two_underscores starting with a tilde.
It works exactly as expected, producing the output
A line with an_underscore. A line with_two_underscores. ~A line with an+underscore starting with a tilde. ~A line with+two_underscores starting with a tilde.
That is, on the first tilde line, it replaces the only underscore with a plus, and on the second, it replaces only the first underscore, leaving the second one an underscore in the output.

To modify this code so that it changes all underscores in tilde lines to plus signs, the classic method is to add the /g option to the s// operator. But this doesn't work; it produces the same output as above, leaving the second underscore in the last line an underscore rather than changing it also to a plus sign as /g requests. What is going wrong?

In case the \K inside the regular expression was somehow inhibiting the /g option, I rewrote the substitution line to use a capture group instead:
s/( ^ ~ .*? ) _ /$1+/xg;
But this exhibits the same failure.

Replies are listed 'Best First'.
Re: /g option not making s// find all matches
by Eily (Monsignor) on May 28, 2018 at 08:56 UTC

    FYI, One possible way to do what you want is:

    while (<DATA>) { next unless /^~/;# Skip line unless it starts with ~ tr/_/+/; } continue { print; }

    The reason your /g version doesn't match is because /g doesn't try to match several times from the start of the string, but from the position last match. So after turning "~a_b_c_d" into "~a+b_c_d" it will start looking after the +, in the substring "b_c_d". Because this substring is not at the beginning of the string, ^ fails, (and ~ would obviously fail as well).

    The \G anchor, meaning "from the last match" let's you express "and underscore after a ~ or any underscore after that":

    pos($_) = -1; s/(^~ | \G) .*? \K _ /+/gx;
    (^~|\G) can either match a ~ at the beginning of a string, or at the position of the last match. But since I have forced that position at -1, ^~ is the only possible alternative on the first try. Beyond that point, the s/// will run in a loop, where ^~ fails (because not at the beginning of the string), but \G matches the position of the last iteration.

    NB: because $_ is the default variable, pos($_) = -1 can be simplified to pos = -1

    NB2: and the reason jwkrahn's solution works is because instead of using /g, it calls the s/// operator in a loop, meaning it starts over again on each iteration (after turning "~a_b_c_d", it will run on "~a+b_c_d").

      /g doesn't try to match several times from the start of the string, but from the position last match.
      Thank you! This is the kernel of wisdom I was overlooking. Thank you both for the various working solutions.
      But since I have forced that position at -1

      Which only works because input wasn't chomped, right? I wonder if someone misreads it as "which makes \G match nowhere". While, assigning negative values to pos counts leftward from the end (similar to parameter of substr, or negative array subscript, etc.). Surprise or not, this pos behavior isn't documented (nor silent clumping if value is out of bounds).

        I wonder if someone misreads it as "which makes \G match nowhere"
        Well I miswrote it as that, I didn't know about the negative value of pos. pos = length will behave better in that case.

Re: /g option not making s// find all matches
by jwkrahn (Monsignor) on May 28, 2018 at 08:23 UTC

    As you have figured out the problem is not with the \K anchor, it is with the ^ (match only at beginning of line) anchor.

    You can fix that with something like this:

    while (<DATA>) { 1 while s/^ ~ .*? \K _ /+/x; print "$_"; }

    Or perhaps something like this:

    while (<DATA>) { s/^ ~ .*? \K ( .+ ) / ( my $x = $1 ) =~ tr-_-+-; $x /xe; print "$_"; }
Re: /g option not making s// find all matches (updated)
by AnomalousMonk (Bishop) on May 28, 2018 at 17:27 UTC

    I like Eily's  next unless ...; approach++ best, but here's a variation on his or her  s///g solution using  \G that doesn't depend on hocus-pocusing pos:

    c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; my @lines = ( 'A line with an_underscore.', 'A line with_two_underscores.', '~A line with an_underscore starting with a tilde.', '~A line with_two_underscores starting with a tilde.', ); ;; for my $line (@lines, @ARGV) { print qq{'$line'}; $line =~ s{ (?: \G (?! \A) | \A ~) .*? \K _ }{+}xmsg; print qq{'$line' \n}; } " "___" "~___" 'A line with an_underscore.' 'A line with an_underscore.' 'A line with_two_underscores.' 'A line with_two_underscores.' '~A line with an_underscore starting with a tilde.' '~A line with an+underscore starting with a tilde.' '~A line with_two_underscores starting with a tilde.' '~A line with+two+underscores starting with a tilde.' '___' '___' '~___' '~+++'

    Update 1: FFR & FWIW, here's a version in the form of a Short, Self-Contained, Correct Example using a How to ask better questions using Test::More and sample data structure. Of course, the original question would have been submitted with plenty of test cases (and don't forget degenerate and simple cases!) and with the
        $input =~ s{ ... }{+}xmsg;
    statement being raygun's current, unacceptable one, or maybe just a placeholder.


    Give a man a fish:  <%-{-{-{-<

      Thank you for the detailed answer. For readability, I too like the next unless ...; solution; however, in context, I kind of needed a single regular expression to avoid retooling the surrounding code.

      I never would have thought of the \G (?! \A) construct — I'm still not quite certain I understand how it works, but it does the trick!

      The /ms options seem unnecessary here; did you include them just to make the solution as generic as possible?

        ... the \G (?! \A) construct ... how it works ...

        This simply asserts that \G (at previous match point or at absolute start of string if first match) is true, and that \A (at absolute start of string) is not true; i.e., that \G is not matching at the start of the string. This is a little confusing in that  (?! ...) is a negative look-ahead and you may wonder how one can look ahead to the absolute start of a string. However, \G and \A are both zero-width assertions, so it doesn't matter which way you look as long as you're negating. The following all work identically:
            \G (?! \A)    \G (?<! \A)    (?! \A) \G    (?<! \A) \G

        The /ms options seem unnecessary here; did you include them just to make the solution as generic as possible?

        In line with TheDamian's regex Perl Best Practices, I always use an  /xms tail on every  qr// m// s/// expression I write. Of course, /x allows whitespace and comments; can't be bad. In addition,  ^ $ . always behave in the same way and I don't have to think about it any more; regexes are complicated enough as it is. E.g., the behavior of  . (dot) is "by default, dot matches everything except a newline unless the /s modifier is asserted, in which case dot matches everything — now let's see whether or not there's a /s around anywhere". TheDamian recommends and I prefer to always use the /s modifier and just think about the behavior of dot as "dot matches all". Period. Similarly for the  ^ $ assertions and the /m modifier. (In general, I have quite a bit of respect for TheDamian's PBPs. I don't agree with them all, but I have embraced the regex PBPs completely and wholeheartedly.)

        So in answer to your direct question, the universal  /xms tail is not used to make regexes generic so much as to make them less-thought-needed-ic.


        Give a man a fish:  <%-{-{-{-<

Re: /g option not making s// find all matches
by tybalt89 (Prior) on May 29, 2018 at 03:44 UTC

    Unless it's homework, this might be simpler.

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1215296 use strict; use warnings; while(<DATA>) { /^~/ and tr/_/+/; print; } __DATA__ A line with an_underscore. A line with_two_underscores. ~A line with an_underscore starting with a tilde. ~A line with_two_underscores starting with a tilde.
Re: /g option not making s// find all matches
by raygun (Scribe) on May 31, 2018 at 19:31 UTC

    In the spirit of TIMTOWTDI, I offer another solution to my own problem.

    After using the tips above to solve this problem, I moved on to the next one and, in the process, ran across a section of code employing a solution offered to another question I had asked here three years ago. And it turns out the solution I used in that case would have solved this current issue as well. Apparently, even the small fraction of perl that I know is too large to all fit into my brain at once.

    The substitution line in my original example could become
    s/^ ~ .* /@{[ ${^MATCH} =~ tr#_#+#r ]}/xp;
    This separates the test for tilde from the substitution action (conceptually akin to tybalt89's solution and Eily's first one), yet keeps the whole thing inside a single s/// operator. And however much I admire the cleverness of \G (?! \A), its meaning is obscure even after its clockwork has been explained, whereas the above line is fairly easy to parse even if you've never encountered the @{[...]} idiom before.

    For thoroughness, I plugged this s/// into AnomalousMonk's more comprehensive test suite above, and it passed all those as well.

      ... keeps the whole thing inside a single s/// operator.

      The only thing I would say about this is that you're firing up the eval-uator behind the scenes, so
          s/^ ~ .* /@{[  ${^MATCH} =~ tr#_#+#r  ]}/xp;
      is (I think; haven't tested it) exactly equivalent to
          s/^ ~ .* / ${^MATCH} =~ tr#_#+#r /xpe;
      That's an awful lot of moving parts for what seems a fairly simple match and transformation. Getting back to Eily's first solution and in particular to tybalt89's solution, I don't see a problem with something like (also untested):

      while (<FILEHANDLE>) { tr/_/+/ if m{ \A ~ }xms; do_something_with_fixed_up_line($_); }
      Simple, clear, one-step fixup, do whatever you want with the line thereafter.

      Oh, and BTW: Please don't use # as a delimiter for regex expressions; I know we're talking TimToady here, but there's no need for perversity!


      Give a man a fish:  <%-{-{-{-<

        Simple, clear, one-step fixup
        Yes, and absolutely the solution I would use if my real-life problem matched this simplified example of looping through an array or input stream. My code is doing the opposite: siccing an array of regex substitutions on a single string. Thus each modification has to be containable in an s///. (I can handle special cases outside this array, but minimizing special cases is a goal.)
        Please don't use # as a delimiter for regex expressions
        Yeah, I get that it indicates a comment, but for simple, inline expressions such as that one (a whole six characters after the tr), I like it for how it visually stands out better than most characters within a typical regular expression, making it easy to find the boundaries of each element. In code I write, all comments are set off with plenty of whitespace, so the eye won't be tricked into thinking a comment is lurking in the middle of what otherwise looks like a line of code. ("@" also stands out in many terminal fonts, but that would be true perversity.)