Re: Strange regex to test for newlines: /.*\z/
by shmem (Chancellor) on May 21, 2007 at 13:37 UTC
|
If you have a newline in the string, it's multiline, so you need the 's' modifier:
perl -le '$_ = "foo\n";print "string with trailing newline" if !/.*\z/
+ and /.*\z/s'
string with trailing newline
perl -le '$_ = "foo\nbar";print "string with trailing newline" if !/.*
+\z/ and /.*\z/s'
Otherwise the matching stops at the newline, but that isn't the end of the string. It is a single line if you match the end with '$', but after the \n, you are on the next line, and the end of the string happens to be there. How can I put it? It seems logical to me, but I've got to struggle yet with wording.. I'll update this post until I've got it, sorry for that.
update - seems like Ojosh!ro found the right words. Ojosh!ro++, thanks :-)
--shmem
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Why should I need an /s for multiline strings?
$ perl -e 'print "match\n" if "foo\nbar" =~ m/bar/;'
match
So when an ordinary string after an \n is matched, why should an empty string, here presented by .*, fail to match?
After all the regex is not anchored to the start of the string | [reply] [Watch: Dir/Any] [d/l] |
|
Because in a //m, the end of string matching "f\n" is set before the '\n' if the '\n' is trailing. The '\n' is skipped in the match, but the position after "f" isn't the end of the string:
perl -D512 -e '$_ = "f\n";/.*\z/'
Compiling REx `.*\z'
size 4 Got 36 bytes for offset annotations.
first at 2
rarest char
at 0
1: STAR(3)
2: REG_ANY(0)
3: EOS(4)
4: END(0)
floating ""$ at 0..2147483647 (checking floating) anchored(MBOL) impli
+cit minlen 0
Offsets: [4]
2[1] 1[1] 3[2] 5[0]
Omitting $` $& $' support.
EXECUTING...
Guessing start of match, REx ".*\z" against "f
"...
Found floating substr ""$ at offset 1...
Position at offset 0 does not contradict /^/m...
Guessed: match at offset 0
Matching REx ".*\z" against "f
"
Setting an EVAL scope, savestack=3
0 <> <f
> | 1: STAR
REG_ANY can match 1 times out of 2147483647
+...
Setting an EVAL scope, savestack=3
1 <f> <
> | 3: EOS
failed...
failed...
Guessing start of match, REx ".*\z" against "
"...
Found floating substr ""$ at offset 0...
Position at offset 0 does not contradict /^/m...
Guessed: match at offset 0
Setting an EVAL scope, savestack=3
1 <f> <
> | 1: STAR
REG_ANY can match 0 times out of 2147483647
+...
Setting an EVAL scope, savestack=3
1 <f> <
> | 3: EOS
failed...
failed...
Match failed
Freeing REx: `".*\\z"'
The matching isn't extended after the "\n". Whereas here
perl -D512 -e '$_ = "f\n";/.*\z/s'
Compiling REx `.*\z'
size 4 Got 36 bytes for offset annotations.
first at 2
rarest char
at 0
1: STAR(3)
2: SANY(0)
3: EOS(4)
4: END(0)
floating ""$ at 0..2147483647 (checking floating) anchored(SBOL) impli
+cit minlen 0
Offsets: [4]
2[1] 1[1] 3[2] 5[0]
Omitting $` $& $' support.
EXECUTING...
Guessing start of match, REx ".*\z" against "f
"...
Found floating substr ""$ at offset 1...
Guessed: match at offset 0
Matching REx ".*\z" against "f
"
Setting an EVAL scope, savestack=6
0 <> <f
> | 1: STAR
SANY can match 2 times out of 2147483647...
Setting an EVAL scope, savestack=6
2 <f
> <> | 3: EOS
2 <f
> <> | 4: END
Match successful!
Freeing REx: `".*\\z"'
you can see that the '\z' (<> in the debug output) is found after the "\n":
Setting an EVAL scope, savestack=6
2 <f
> <> | 3: EOS
2 <f
> <> | 4: END
--shmem
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
|
|
|
Re: Strange regex to test for newlines: /.*\z/
by avar (Beadle) on May 21, 2007 at 23:13 UTC
|
To elaborate a bit what should probably be happening is that the regex engine should always backtrack on .* into the equivalent of NOTHING, but instead it commits to the match and fails under /m once it reaches \n.
perl5.9.5 -Mre=debug -e 'my @re = (qr/.*\z/, qr/.?\z/, qr/(|.+)\z/); "\n" ~~ $_ for @re'
Compiling REx ".*\z"
Final program:
1: STAR (3)
2: REG_ANY (0)
3: EOS (4)
4: END (0)
floating ""$ at 0..2147483647 (checking floating) anchored(MBOL) implicit minlen 0
Compiling REx ".?\z"
Final program:
1: CURLY {0,1} (4)
3: REG_ANY (0)
4: EOS (5)
5: END (0)
floating ""$ at 0..1 (checking floating) minlen 0
Compiling REx "(|.+)\z"
Final program:
1: OPEN1 (3)
3: BRANCH (5)
4: NOTHING (8)
5: BRANCH (FAIL)
6: PLUS (8)
7: REG_ANY (0)
8: CLOSE1 (10)
10: EOS (11)
11: END (0)
floating ""$ at 0..2147483647 (checking floating) minlen 0
Guessing start of match in sv for REx ".*\z" against "%n"
Found floating substr ""$ at offset 0...
Position at offset 0 does not contradict /^/m...
Guessed: match at offset 0
Matching REx ".*\z" against "%n"
0 <> <%n> | 1:STAR(3)
REG_ANY can match 0 times out of 2147483647...
0 <> <%n> | 3: EOS(4)
failed...
failed...
Match failed
Guessing start of match in sv for REx ".?\z" against "%n"
Found floating substr ""$ at offset 0...
Guessed: match at offset 0
Matching REx ".?\z" against "%n"
0 <> <%n> | 1:CURLY {0,1}(4)
REG_ANY can match 0 times out of 1...
0 <> <%n> | 4: EOS(5)
failed...
failed...
1 <%n> <> | 1:CURLY {0,1}(4)
REG_ANY can match 0 times out of 1...
1 <%n> <> | 4: EOS(5)
1 <%n> <> | 5: END(0)
Match successful!
Guessing start of match in sv for REx "(|.+)\z" against "%n"
Found floating substr ""$ at offset 0...
Guessed: match at offset 0
Matching REx "(|.+)\z" against "%n"
0 <> <%n> | 1:OPEN1(3)
0 <> <%n> | 3:BRANCH(5)
0 <> <%n> | 4: NOTHING(8)
0 <> <%n> | 8: CLOSE1(10)
0 <> <%n> | 10: EOS(11)
failed...
0 <> <%n> | 5:BRANCH(8)
0 <> <%n> | 6:PLUS(8)
REG_ANY can match 0 times out of 2147483647...
failed...
1 <%n> <> | 1:OPEN1(3)
1 <%n> <> | 3:BRANCH(5)
1 <%n> <> | 4: NOTHING(8)
1 <%n> <> | 8: CLOSE1(10)
1 <%n> <> | 10: EOS(11)
1 <%n> <> | 11: END(0)
Match successful!
Freeing REx: ".*\z"
Freeing REx: ".?\z"
Freeing REx: "(|.+)\z"
| [reply] [Watch: Dir/Any] |
|
I am not quite agreeable to the statement about what '.*' should match.
For my understanding '.' should ignore newlines always but if the operator /s is used. That means that '.+' and '.*' are just multiple searches of '.' and should still ignore newlines.
Now I understand $ and \z as the following... $ means to matches both the end and the newline before - quote perldoc - and \z only the end but not the newline.
print "foo matched\n" if "foo\n" =~ /^foo$/;
print "bar matched\n" if "bar\n" =~ /^bar$ \n/x; # $ before end or newline
print "baz doesn't matched\n" if "baz\n" !~ /^baz\z/;
print "foobar matched\n" if "foobar\n" =~ /^foobar\n\z/; # \z after newline
print "match foo\n" if "foo\n" =~ /.*$/; # .* ignore newline and $ is before newline
print "doesn't match bar\n" if "bar\n" !~ /.*\z/; # .* ignore newline and \z is after newline
print "match baz\n" if "baz\n" =~ /.?\z/; # but what the hell happends here?
for ( qr/(.?)\n\z/, qr/(.?)\z/ ) {
"hello world\n" =~ $_;
print "-$1-\n";
}
-d-
--
It seems that '.?' ignore the newline as expected and search on after the newline with '.?\z', because it searches _until_ '\z'. Also it seems that '.*' matches until the newline and not between '\n' and '\z'. '.*' is greedy, '.?' not. Maybe I missunderstand it. | [reply] [Watch: Dir/Any] |
|
$ and \Z work pretty much the same in normal mode, both match the end of search string or before a string-ending newline. the difference between them lies in the multiline mode when you issue an 'm' modifier.
\z means the real end of string even after the string-ending newline.
If you use an 's' modifier, then things become more different but that's mainly coz of the '.' which changes its behaviors, not the three end-of-string anchors..
check the following snippets:
perl -e 'print "match\n" if "foo\n" =~ /.+$/' # ok #
perl -e 'print "match\n" if "foo\n" =~ /.+\z/'
perl -e 'print "match\n" if "foo\n" =~ /.+\Z/' # ok #
perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\Z/'
perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\z/'
perl -e 'print "match\n" if "foo\n\n\n" =~ /.+$/'
perl -e 'print "match\n" if "foo\n\n\n" =~ /.+$/m' # ok #
perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\z/m'
perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\Z/m'
perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\Z/s' # ok #
perl -e 'print "match\n" if "foo\n\n\n" =~ /.+\z/s' # ok #
perl -e 'print "match\n" if "foo\n\n\n" =~ /.+$/s' # ok #
BTW. When comparing between \z, \Z and $, it's probably better to avoid using .* or .? quanifiers the ways in your examples.
BTW. my previous statement about \Z had some error and I have updated that post.
Regards,
Xicheng | [reply] [Watch: Dir/Any] [d/l] |
Re: Strange regex to test for newlines: /.*\z/
by bloonix (Monk) on May 21, 2007 at 12:45 UTC
|
thats correct because .* matches all but the newline and
\z search for the end of the string, but there still exists
the newline
http://perldoc.perl.org/perlre.html
"To match the actual end of the string and not ignore an optional trailing newline, use \z ."
@tinita, @betterworld:
Thanks to you because I saw that /m is not necessary in my
code! | [reply] [Watch: Dir/Any] |
Re: Strange regex to test for newlines: /.*\z/
by Joost (Canon) on May 21, 2007 at 12:28 UTC
|
It works because . does not match newlines by default. see the /s modifier at perlre.
| [reply] [Watch: Dir/Any] |
|
It works because . does not match newlines by default.
i think betterworld knows this =)
the point is, should .* match the empty string between
foo\n and the end?
update:
"\n" =~ /\n.*\z/; # matches
"\n" =~ /.*\z/; # doesn't match. huh?
"\n" =~ /[^\n]*\z/; # matches. ??
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
|
It would work if you used $ instead. \z specifically asks for the end of the string and . won't match the \ns without a /s on the end. So .* and \z aren't adjacent in your first example — without the /s.
It probably would work with "\n" because .* could match 0 chars. But in your example, it already matched several chars before the \n which it refuses to slurp, so it (meaning the DFA) can't get to the \z at the end.
Of course, the more I think about it... you might argue that the .* should backtrack, skip over the \n and continue to match the 0 chars right before the \z. I believe \z has a special meaning though. It's not really matching characters. It's more of a border on things, like \b, so I suspect there really aren't 0 characters before it because it's not really there.
UPDATE: I'm completely convinced this is a bug based on "id-616551" below.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
Re: Strange regex to test for newlines: /.*\z/
by Anonymous Monk on May 21, 2007 at 12:45 UTC
|
I think that /.*\z/ should match any string indeed, and that the regex engine has a bug here. | [reply] [Watch: Dir/Any] |
|
I don't think it's a bug.
When the match is in /m mode .* will match anything BUT a newline. ( when in /s mode .* will match anything )
I assume what it is trying to match is one line.
So basically what this test does is :
"Between all characters (on this line) that are not newlines, and the end of the string, are there any other characters?", if so, it won't match. If it doesn't match, the only character that can cause it is a newline.
It does sound a bit like a roundabout way to get what you want though.
How about if ( $foo !~ /\n\z/ )
BTW. setting $/ has no influence on /m or /s whatsoever? Not that I could find with experimentation.
if( exists $aeons{strange} ){ die $death unless ( $death%2 ) }
| [reply] [Watch: Dir/Any] [d/l] |
|
One problem tho, the following all match the string "\n":
/.*/
/\z/
/.{0}\z/
It's possible that \z is meant to introduce some specialness when combined with .* (or possibly some other quantifiers), but I haven't seen it mentioned in any docs. This is either a bug, or a very poorly documented feature. | [reply] [Watch: Dir/Any] [d/l] |
|
.* will match anything but a newline, or the empty string.
So I'd expect
"foo\n" =~ /.*\z/;
to match, but capture the empty string in $&, not "foo\n".
Of course there are more elaborate ways to match for a newline character ;-)
| [reply] [Watch: Dir/Any] [d/l] |
|
|
According to your reasoning, the first of the following one-liners shouldn't print anything either:
$ perl -lwe 'print "match" if "foo\n" =~ /[^\n]*\z/'
match
$ perl -lwe 'print "match" if "foo\n" =~ /.*\z/'
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
|
In r31303 of bleadperl this bug is fixed:
$ perl5.9.5 -E 'say "match" if "f\n" ~~ /.*\z/'
match
| [reply] [Watch: Dir/Any] |
|
|
No, it's not a bug. check carefully what's the difference between \z and \Z. and check the following samples:
perl -e 'print "match\n" if "foo\n" =~ /.*\z/'
perl -e 'print "match\n" if "foo\n" =~ /.*\Z/'
perl -e 'print "match\n" if "foo\n\n\n" =~ /.*\Z/'
Update: the third one matches just coz of .* in use. \Z can not keep multiple newlines.
Regards,
Xicheng | [reply] [Watch: Dir/Any] [d/l] |
|
perl -e 'print "match\n" if "foo\n" =~ /.{0,}\z/'
AFAIK, .* and .{0,} should be exactly equivilent, but when combined with /z they are not, if the string ends in a newline.
There definitely appears to be a bug here, but it may be that the above snippet should not match, rather than the version with .* matching. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
Indeed.
Quoting and a bit paraphrasing "Mastering Regular Expressions 2nd Edition":
A match mode can change the meaning of "$" to match before any embedde
+d newline (or Unicode line terminator as well). When supported, "\Z"
+usually matches what the "unmoded" "$" matches, which often means to
+match at the end of the string, or before a string-ending newline. To
+ complement these, "\z" matches only at the end of the string, period
+, without regard to any newline.
..
//s stands for Single Line Mode which makes the dot match any characte
+r.
..
//m stands for Multi Line Mode which changes how ^& $ are considered b
+y the regex engine. ^ is then begin of 1 line out of the many lines i
+n the string and not begin of string and $ is end of 1 line out of th
+e many lines in the string and not end of string.
..
Caret "^" matches at the beginning of the text being searched, and, if
+ in an enhanced line-anchor match mode after any newline.
..
\A always matches only at the start of the text being searched, regard
+less of single or multi line match mode.
..
"\Z" matches what the "unmoded" "$" matches, which means to match at t
+he end of the string, or before a string-ending newline. To complemen
+t these, "\z" matches only at the end of the string, period, without
+regard to any newline.
With thanks to Jeffrey Friedl's Regex Holy Book! ;-) | [reply] [Watch: Dir/Any] [d/l] |
Re: Strange regex to test for newlines: /.*\z/
by demerphq (Chancellor) on May 28, 2007 at 17:48 UTC
|
I posted a patch to fix this bug today. So it should be resolved in the next release of blead and in Perl 5.10. Thanks for reporting it. And thanks to whoever filed the perlbug on it too (if that wasnt you).
update: Patch was applied as find 31303 in perl.git commits
---
$world=~s/war/peace/g
| [reply] [Watch: Dir/Any] |