Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Strange regex to test for newlines: /.*\z/

by betterworld (Deacon)
on May 21, 2007 at 12:25 UTC ( #616538=perlquestion: print w/ replies, xml ) Need Help??
betterworld has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

apparently, /.*\z/ tests whether a string ends in a newline:

$ perl -e 'print "there is no newline\n" if "foo\n" =~ /.*\z/' $ perl -e 'print "there is no newline\n" if "foo" =~ /.*\z/' there is no newline

I find this very strange. In my opinion, this regular expression should always match. Every string, even "foo\n", has an end (\z), and since .* matches zero or more horizontal characters, it should always match the empty string before the end of the string. However, it doesn't, as you can see in the first one-liner.

I found this RE in the code of Log::Handler by bloonix and I'm wondering why it works.

I've tested it in perl versions 5.8.8, 5.005_03 and 5.9.4.

Update: This perl bug is now fixed (see below). Thanks to demerphq.

Comment on Strange regex to test for newlines: /.*\z/
Select or Download Code
Re: Strange regex to test for newlines: /.*\z/
by Joost (Canon) on May 21, 2007 at 12:28 UTC
      It works because . does not match newlines by default.

      That's what I meant by "horizontal characters". But as "." has an asterisk after it, the regex should also match if "." does not match, shouldn't it?

        It would work if you used $ instead. \z specifically asks for the end of the string and . won't match the \ns without a /s on the end. So .* and \z aren't adjacent in your first example — without the /s.

        It probably would work with "\n" because .* could match 0 chars. But in your example, it already matched several chars before the \n which it refuses to slurp, so it (meaning the DFA) can't get to the \z at the end.

        Of course, the more I think about it... you might argue that the .* should backtrack, skip over the \n and continue to match the 0 chars right before the \z. I believe \z has a special meaning though. It's not really matching characters. It's more of a border on things, like \b, so I suspect there really aren't 0 characters before it because it's not really there.

        UPDATE: I'm completely convinced this is a bug based on "id-616551" below.

        -Paul

      It works because . does not match newlines by default.
      i think betterworld knows this =)
      the point is, should .* match the empty string between foo\n and the end?

      update:

      "\n" =~ /\n.*\z/; # matches "\n" =~ /.*\z/; # doesn't match. huh? "\n" =~ /[^\n]*\z/; # matches. ??
Re: Strange regex to test for newlines: /.*\z/
by bloonix (Scribe) on May 21, 2007 at 12:45 UTC
    thats correct because .* matches all but the newline and \z search for the end of the string, but there still exists the newline
    http://perldoc.perl.org/perlre.html
    
    "To match the actual end of the string and not ignore an optional trailing newline, use \z ."
    @tinita, @betterworld:
    Thanks to you because I saw that /m is not necessary in my code!
Re: Strange regex to test for newlines: /.*\z/
by Anonymous Monk on May 21, 2007 at 12:45 UTC
    I think that /.*\z/ should match any string indeed, and that the regex engine has a bug here.
        In r31303 of bleadperl this bug is fixed:
        $ perl5.9.5 -E 'say "match" if "f\n" ~~ /.*\z/'
        match
        
      I don't think it's a bug.

      When the match is in /m mode .* will match anything BUT a newline. ( when in /s mode .* will match anything )
      I assume what it is trying to match is one line.

      So basically what this test does is :
      "Between all characters (on this line) that are not newlines, and the end of the string, are there any other characters?", if so, it won't match. If it doesn't match, the only character that can cause it is a newline.
      It does sound a bit like a roundabout way to get what you want though.
      How about if ( $foo !~ /\n\z/ )

      BTW. setting $/ has no influence on /m or /s whatsoever?
      Not that I could find with experimentation.

      if( exists $aeons{strange} ){ die $death unless ( $death%2 ) }
        .* will match anything but a newline, or the empty string.

        So I'd expect "foo\n" =~ /.*\z/; to match, but capture the empty string in $&, not "foo\n".

        Of course there are more elaborate ways to match for a newline character ;-)

        One problem tho, the following all match the string "\n":
        /.*/ /\z/ /.{0}\z/
        It's possible that \z is meant to introduce some specialness when combined with .* (or possibly some other quantifiers), but I haven't seen it mentioned in any docs. This is either a bug, or a very poorly documented feature.
        According to your reasoning, the first of the following one-liners shouldn't print anything either:
        $ perl -lwe 'print "match" if "foo\n" =~ /[^\n]*\z/' match $ perl -lwe 'print "match" if "foo\n" =~ /.*\z/'
      No, it's not a bug. check carefully what's the difference between \z and \Z. and check the following samples:
      perl -e 'print "match\n" if "foo\n" =~ /.*\z/' perl -e 'print "match\n" if "foo\n" =~ /.*\Z/' perl -e 'print "match\n" if "foo\n\n\n" =~ /.*\Z/'
      Update: the third one matches just coz of .* in use. \Z can not keep multiple newlines.

      Regards,
      Xicheng
        Fair enough, but try:
        perl -e 'print "match\n" if "foo\n" =~ /.{0,}\z/'
        AFAIK, .* and .{0,} should be exactly equivilent, but when combined with /z they are not, if the string ends in a newline.

        There definitely appears to be a bug here, but it may be that the above snippet should not match, rather than the version with .* matching.
        Indeed. Quoting and a bit paraphrasing "Mastering Regular Expressions 2nd Edition":
        A match mode can change the meaning of "$" to match before any embedde +d newline (or Unicode line terminator as well). When supported, "\Z" +usually matches what the "unmoded" "$" matches, which often means to +match at the end of the string, or before a string-ending newline. To + complement these, "\z" matches only at the end of the string, period +, without regard to any newline. .. //s stands for Single Line Mode which makes the dot match any characte +r. .. //m stands for Multi Line Mode which changes how ^& $ are considered b +y the regex engine. ^ is then begin of 1 line out of the many lines i +n the string and not begin of string and $ is end of 1 line out of th +e many lines in the string and not end of string. .. Caret "^" matches at the beginning of the text being searched, and, if + in an enhanced line-anchor match mode after any newline. .. \A always matches only at the start of the text being searched, regard +less of single or multi line match mode. .. "\Z" matches what the "unmoded" "$" matches, which means to match at t +he end of the string, or before a string-ending newline. To complemen +t these, "\z" matches only at the end of the string, period, without +regard to any newline.
        With thanks to Jeffrey Friedl's Regex Holy Book! ;-)
Re: Strange regex to test for newlines: /.*\z/
by shmem (Canon) on May 21, 2007 at 13:37 UTC
    If you have a newline in the string, it's multiline, so you need the 's' modifier:
    perl -le '$_ = "foo\n";print "string with trailing newline" if !/.*\z/ + and /.*\z/s' string with trailing newline perl -le '$_ = "foo\nbar";print "string with trailing newline" if !/.* +\z/ and /.*\z/s'

    Otherwise the matching stops at the newline, but that isn't the end of the string. It is a single line if you match the end with '$', but after the \n, you are on the next line, and the end of the string happens to be there. How can I put it? It seems logical to me, but I've got to struggle yet with wording.. I'll update this post until I've got it, sorry for that.

    update - seems like Ojosh!ro found the right words. Ojosh!ro++, thanks :-)

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      Why should I need an /s for multiline strings?
      $ perl -e 'print "match\n" if "foo\nbar" =~ m/bar/;' match
      So when an ordinary string after an \n is matched, why should an empty string, here presented by .*, fail to match?

      After all the regex is not anchored to the start of the string

        Because in a //m, the end of string matching "f\n" is set before the '\n' if the '\n' is trailing. The '\n' is skipped in the match, but the position after "f" isn't the end of the string:
        perl -D512 -e '$_ = "f\n";/.*\z/' Compiling REx `.*\z' size 4 Got 36 bytes for offset annotations. first at 2 rarest char at 0 1: STAR(3) 2: REG_ANY(0) 3: EOS(4) 4: END(0) floating ""$ at 0..2147483647 (checking floating) anchored(MBOL) impli +cit minlen 0 Offsets: [4] 2[1] 1[1] 3[2] 5[0] Omitting $` $& $' support. EXECUTING... Guessing start of match, REx ".*\z" against "f "... Found floating substr ""$ at offset 1... Position at offset 0 does not contradict /^/m... Guessed: match at offset 0 Matching REx ".*\z" against "f " Setting an EVAL scope, savestack=3 0 <> <f > | 1: STAR REG_ANY can match 1 times out of 2147483647 +... Setting an EVAL scope, savestack=3 1 <f> < > | 3: EOS failed... failed... Guessing start of match, REx ".*\z" against " "... Found floating substr ""$ at offset 0... Position at offset 0 does not contradict /^/m... Guessed: match at offset 0 Setting an EVAL scope, savestack=3 1 <f> < > | 1: STAR REG_ANY can match 0 times out of 2147483647 +... Setting an EVAL scope, savestack=3 1 <f> < > | 3: EOS failed... failed... Match failed Freeing REx: `".*\\z"'

        The matching isn't extended after the "\n". Whereas here

        perl -D512 -e '$_ = "f\n";/.*\z/s' Compiling REx `.*\z' size 4 Got 36 bytes for offset annotations. first at 2 rarest char at 0 1: STAR(3) 2: SANY(0) 3: EOS(4) 4: END(0) floating ""$ at 0..2147483647 (checking floating) anchored(SBOL) impli +cit minlen 0 Offsets: [4] 2[1] 1[1] 3[2] 5[0] Omitting $` $& $' support. EXECUTING... Guessing start of match, REx ".*\z" against "f "... Found floating substr ""$ at offset 1... Guessed: match at offset 0 Matching REx ".*\z" against "f " Setting an EVAL scope, savestack=6 0 <> <f > | 1: STAR SANY can match 2 times out of 2147483647... Setting an EVAL scope, savestack=6 2 <f > <> | 3: EOS 2 <f > <> | 4: END Match successful! Freeing REx: `".*\\z"'

        you can see that the '\z' (<> in the debug output) is found after the "\n":

        Setting an EVAL scope, savestack=6 2 <f > <> | 3: EOS 2 <f > <> | 4: END

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Strange regex to test for newlines: /.*\z/
by avar (Beadle) on May 21, 2007 at 23:13 UTC
    To elaborate a bit what should probably be happening is that the regex engine should always backtrack on .* into the equivalent of NOTHING, but instead it commits to the match and fails under /m once it reaches \n.
    perl5.9.5 -Mre=debug -e 'my @re = (qr/.*\z/, qr/.?\z/, qr/(|.+)\z/); "\n" ~~ $_ for @re'
    
    Compiling REx ".*\z"
    Final program:
       1: STAR (3)
       2:   REG_ANY (0)
       3: EOS (4)
       4: END (0)
    floating ""$ at 0..2147483647 (checking floating) anchored(MBOL) implicit minlen 0 
    Compiling REx ".?\z"
    Final program:
       1: CURLY {0,1} (4)
       3:   REG_ANY (0)
       4: EOS (5)
       5: END (0)
    floating ""$ at 0..1 (checking floating) minlen 0 
    Compiling REx "(|.+)\z"
    Final program:
       1: OPEN1 (3)
       3:   BRANCH (5)
       4:     NOTHING (8)
       5:   BRANCH (FAIL)
       6:     PLUS (8)
       7:       REG_ANY (0)
       8: CLOSE1 (10)
      10: EOS (11)
      11: END (0)
    floating ""$ at 0..2147483647 (checking floating) minlen 0 
    Guessing start of match in sv for REx ".*\z" against "%n"
    Found floating substr ""$ at offset 0...
    Position at offset 0 does not contradict /^/m...
    Guessed: match at offset 0
    Matching REx ".*\z" against "%n"
       0 <> <%n>                 |  1:STAR(3)
                                      REG_ANY can match 0 times out of 2147483647...
       0 <> <%n>                 |  3:  EOS(4)
                                        failed...
                                      failed...
    Match failed
    Guessing start of match in sv for REx ".?\z" against "%n"
    Found floating substr ""$ at offset 0...
    Guessed: match at offset 0
    Matching REx ".?\z" against "%n"
       0 <> <%n>                 |  1:CURLY {0,1}(4)
                                      REG_ANY can match 0 times out of 1...
       0 <> <%n>                 |  4:  EOS(5)
                                        failed...
                                      failed...
       1 <%n> <>                 |  1:CURLY {0,1}(4)
                                      REG_ANY can match 0 times out of 1...
       1 <%n> <>                 |  4:  EOS(5)
       1 <%n> <>                 |  5:  END(0)
    Match successful!
    Guessing start of match in sv for REx "(|.+)\z" against "%n"
    Found floating substr ""$ at offset 0...
    Guessed: match at offset 0
    Matching REx "(|.+)\z" against "%n"
       0 <> <%n>                 |  1:OPEN1(3)
       0 <> <%n>                 |  3:BRANCH(5)
       0 <> <%n>                 |  4:  NOTHING(8)
       0 <> <%n>                 |  8:  CLOSE1(10)
       0 <> <%n>                 | 10:  EOS(11)
                                        failed...
       0 <> <%n>                 |  5:BRANCH(8)
       0 <> <%n>                 |  6:PLUS(8)
                                      REG_ANY can match 0 times out of 2147483647...
                                      failed...
       1 <%n> <>                 |  1:OPEN1(3)
       1 <%n> <>                 |  3:BRANCH(5)
       1 <%n> <>                 |  4:  NOTHING(8)
       1 <%n> <>                 |  8:  CLOSE1(10)
       1 <%n> <>                 | 10:  EOS(11)
       1 <%n> <>                 | 11:  END(0)
    Match successful!
    Freeing REx: ".*\z"
    Freeing REx: ".?\z"
    Freeing REx: "(|.+)\z"
    
      I am not quite agreeable to the statement about what '.*' should match.

      For my understanding '.' should ignore newlines always but if the operator /s is used. That means that '.+' and '.*' are just multiple searches of '.' and should still ignore newlines.

      Now I understand $ and \z as the following... $ means to matches both the end and the newline before - quote perldoc - and \z only the end but not the newline.
      print "foo matched\n"         if "foo\n"     =~  /^foo$/;
      print "bar matched\n"         if "bar\n"     =~  /^bar$ \n/x;   # $ before end or newline
      print "baz doesn't matched\n" if "baz\n"     !~  /^baz\z/;
      print "foobar matched\n"      if "foobar\n"  =~  /^foobar\n\z/; # \z after newline
      
      print "match foo\n"         if "foo\n" =~ /.*$/;     # .* ignore newline and $  is before newline
      print "doesn't match bar\n" if "bar\n" !~ /.*\z/;    # .* ignore newline and \z is after  newline
      print "match baz\n"         if "baz\n" =~ /.?\z/;    # but what the hell happends here?
      
      for ( qr/(.?)\n\z/, qr/(.?)\z/ ) {
         "hello world\n" =~ $_;
         print "-$1-\n";
      }
      
      -d-
      --
      
      It seems that '.?' ignore the newline as expected and search on after the newline with '.?\z', because it searches _until_ '\z'. Also it seems that '.*' matches until the newline and not between '\n' and '\z'. '.*' is greedy, '.?' not. Maybe I missunderstand it.
        $ and \Z work pretty much the same in normal mode, both match the end of search string or before a string-ending newline. the difference between them lies in the multiline mode when you issue an 'm' modifier.

        \z means the real end of string even after the string-ending newline.

        If you use an 's' modifier, then things become more different but that's mainly coz of the '.' which changes its behaviors, not the three end-of-string anchors..

        check the following snippets:
        perl -e 'print "match\n" if "foo\n" =~ /.+$/' # ok # perl -e 'print "match\n" if "foo\n" =~ /.+\z/' perl -e 'print "match\n" if "foo\n" =~ /.+\Z/' # ok # perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\Z/' perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\z/' perl -e 'print "match\n" if "foo\n\n\n" =~ /.+$/' perl -e 'print "match\n" if "foo\n\n\n" =~ /.+$/m' # ok # perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\z/m' perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\Z/m' perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\Z/s' # ok # perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\z/s' # ok # perl -e 'print "match\n" if "foo\n\n\n" =~ /.+$/s' # ok #
        BTW. When comparing between \z, \Z and $, it's probably better to avoid using .* or .? quanifiers the ways in your examples.

        BTW. my previous statement about \Z had some error and I have updated that post.

        Regards,
        Xicheng
Re: Strange regex to test for newlines: /.*\z/
by demerphq (Chancellor) on May 28, 2007 at 17:48 UTC

    I posted a patch to fix this bug today. So it should be resolved in the next release of blead and in Perl 5.10. Thanks for reporting it. And thanks to whoever filed the perlbug on it too (if that wasnt you).

    update: Patch was applied as find 31303 in perl.git commits

    ---
    $world=~s/war/peace/g

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://616538]
Approved by tinita
Front-paged by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (8)
As of 2014-10-25 20:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (149 votes), past polls