http://www.perlmonks.org?node_id=1228382

rverscho has asked for the wisdom of the Perl Monks concerning the following question:

I cannot figure out the following seemingly trivial regex, so I hope some wisdom can be dispensed...

I have a 4-char string, consisting of 'abc' plus a closing newline.
When matching until the end of the string, I get a different result when using '.*$' vs '.*?$': in the first case the closing newline is included, in the second, it is not. Note that /s is being used.
I am mystified how the end of the string can be interpreted different in these regexes, but apparently it is.
What am I missing here?
When the closing char is not a newline, the results are identical, as expected.
This in in Perl v5.14.4 (no options to use a different version).

Thanks much in advance for guidance!

$s = "abc\n"; if ($s =~ /(ab.*?)$/s) { $p = substr($s, @-[1], (@+[1] - @-[1])); print("match A=[$p] length=".length($p)."\n"); } if ($s =~ /(ab.*)$/s) { $q = substr($s, @-[1], (@+[1] - @-[1])); print("match B=[$q] length=".length($q)."\n"); } result: match A=[abc] length=3 match B=[abc ] length=4

Replies are listed 'Best First'.
Re: Regex not matching closing newline?
by haukex (Archbishop) on Jan 11, 2019 at 09:42 UTC

    $ is a zero-width assertion that matches at the end of the string, or just before the newline at the end of the string (note its behavior can be changed by the /m modifier). ? makes the preceding .* non-greedy, so that it allows $ to match just before the newline, while a normal .* is greedy, causing it to gobble up everything it can, and the $ matches just after the newline. Note that since you're using the /s modifier, the dot . can also match a newline, which it normally does not.

    What is your desired behavior? If you want the match to always include the newline, I'd be specific about it: /ab.*\n/, or, if you want the newline to be optional (e.g. the last line in a file may not have a newline), then I'd probably write /ab.*(?:\z|\n)/. If you want the match to always exclude the newline, I'd leave off the /s modifier, or be specific about what you want to match by saying e.g. [^\n]* or \N*.

      Thanks. The \n must be included, so the greedy variant must be used.
      I have been using Perl for so many years, but it is a scary thought that I seem to have missed the part that says "or just before the newline at the end of the string".
      --need to urgently check some existing code now--

        \z, as proposed by haukex might be the most appropriate solution: it only matches at the absolute end of the string, so both greedy and non greedy variants do what you want.

        $p = substr($s, @-[1], (@+[1] - @-[1])); is a strange way to write $p = $1; you know you can access captures with $X with X the number of the of the capture right?

        Also, rather than trying to make your output more explicit with enclosing [ and string length, I strongly advise you to use Data::Dump (or the Core Data::Dumper, but you have to set $Data::Dumper::useqq = 1; for the more explicit version) for debugging (it will take care or invisible, or look-alike characters for you, and works well with refs and structures)

        use Data::Dump 'pp'; $s = "abc\n"; if ($s =~ /(ab.*?)\z/s) { pp($1); } if ($s =~ /(ab.*)\z/s) { pp($1); } __END__ "abc\n" "abc\n"