Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

When exactly do Perl regex's require a full match on a string?

by ELISHEVA (Prior)
on Feb 08, 2009 at 12:46 UTC ( #742253=perlquestion: print w/ replies, xml ) Need Help??
ELISHEVA has asked for the wisdom of the Perl Monks concerning the following question:

I'm confused about when a Perl regex needs a full match on a string.

Recently in a post somebody suggested that '$' matched both the end of a string and a newline. On the other hand the Perl docs (http://perldoc.perl.org/perlre.html#Modifiers) suggest that '$' matches the boundary created by the new line/end of file/string/stream rather than the actual thing that created the boundary, i.e. it does not consume the thing that created the boundary. This would mean that '^a$' should be a partial match on "a\n" and '^a$\n' should be a full match.

To test this hypothesis I wrote up a small script comparing the results of matching "a\n" and "a\n\n" with three different regexs: /a$/, /a$\n/, and a$\z:

#Note: To keep Perl from resolving "$\n" as the variable "$\" #followed by the letter "n", this code sample constructs regexen #using non-interpolating quotes. use strict; use warnings; my @aRegexTests =( ["a\n", '^a$', '$ matches boundary, maybe more?'] , ["a\n", '^a$\n' , '$ matches only boundary, \n matches newline' ] , ["a\n", '^a$\z' , '$ matches only boundary, \z fails because of newline?' ] , ["a\n\n", '^a$' , '$ matches only boundary, \n matches first newline' ] , ["a\n\n", '^a$\n' , '$ matches only boundary, \n matches first newline?' ] ); foreach (@aRegexTests) { my ($sString, $sRegex, $sComment) = @$_; my $sMatch = ($sString =~ /$sRegex/) ? "match" : "no match"; my $sPrint = $sString; $sPrint =~ s/\n/\\n/g; print "string=<$sPrint>\n"; print " no modifier: " . "regex=/$sRegex/\n $sMatch => $sComment\n"; $sMatch = ($sString =~ /$sRegex/s) ? "match" : "no match"; print " s modifier (single line mode): " ."regex=/$sRegex/s\n $sMatch => $sComment\n"; $sMatch = ($sString =~ /$sRegex/m) ? "match" : "no match"; print " m modifier (multi line mode): " ."regex=/$sRegex/m\n $sMatch => $sComment\n"; }

which outputs

string=<a\n> no modifier: regex=/^a$/ match => $ matches boundary, maybe more? s modifier (single line mode): regex=/^a$/s match => $ matches boundary, maybe more? m modifier (multi line mode): regex=/^a$/m match => $ matches boundary, maybe more? string=<a\n> no modifier: regex=/^a$\n/ match => $ matches only boundary, \n matches newline s modifier (single line mode): regex=/^a$\n/s match => $ matches only boundary, \n matches newline m modifier (multi line mode): regex=/^a$\n/m match => $ matches only boundary, \n matches newline string=<a\n> no modifier: regex=/^a$\z/ no match => $ matches only boundary, \z fails because of newline? s modifier (single line mode): regex=/^a$\z/s no match => $ matches only boundary, \z fails because of newline? m modifier (multi line mode): regex=/^a$\z/m no match => $ matches only boundary, \z fails because of newline? string=<a\n\n> no modifier: regex=/^a$/ no match => $ matches only boundary, \n matches first newline s modifier (single line mode): regex=/^a$/s no match => $ matches only boundary, \n matches first newline m modifier (multi line mode): regex=/^a$/m match => $ matches only boundary, \n matches first newline string=<a\n\n> no modifier: regex=/^a$\n/ no match => $ matches only boundary, \n matches first newline s modifier (single line mode): regex=/^a$\n/s no match => $ matches only boundary, \n matches first newline m modifier (multi line mode): regex=/^a$\n/m match => $ matches only boundary, \n matches first newline

It would appear that my original question (is /^a$/ a partial match?) was answered in the affirmative, but it was quickly replaced by another: why do the regexes /^a$/ and /^a$\n/ match "a\n\n" in only the multi-line mode? They match "a\n" (only one \n) in all three modes. The regex doesn't end in "\z" so why does it care that the second "\n" is unmatched? Surely I am misunderstanding something?

Thanks in advance, beth

Update 1: Fixed various typos

Update 2: I'm wondering if maybe the absence of the m modifier means only 0 or 1 new lines allowed? - [addendum 2009.02.08 - The answer to this is a resounding no - see post below by jethro for citation from perl docs and here for test examples.]

Update 3: Added comment to code explaining how above script keeps Perl from thinking "$\n" is the variable "$\" followed by the letter "n". My apologies for any confusion the absence of this comment caused.

Comment on When exactly do Perl regex's require a full match on a string?
Select or Download Code
Re: When exactly do Perl regex's require a full match on a string?
by jethro (Monsignor) on Feb 08, 2009 at 14:34 UTC

    Without any modifiers $ matches the end of the string. But for practical reasons it sort of ignores a newline there. The practical reason is that you often read in lines from files and want to match without first having to chomp the line. A convenience for very small scripts and one liners. Here is a relevant citation from the perlre man page:

    By default, the "^" character is guaranteed to match only the beginning of the string, the "$" character only the end (or before the newline at the end), and Perl does certain optimizations with the assumption that the string contains only one line. Embedded newlines will not be matched by "^" or "$". You may, however, wish to treat a string as a multi-line buffer, such that the "^" will match after any newline within the string (except if the newline is the last character in the string), and "$" will match before any newline. At the cost of a little more overhead, you can do this by using the /m modifier on the pattern match operator. (Older programs did this by setting $*, but this practice has been removed in perl 5.9.)

Re: When exactly do Perl regex's require a full match on a string?
by oshalla (Deacon) on Feb 08, 2009 at 14:56 UTC
    why do the regexes /^a$/ and /^a$\n/ match "a\n\n" in only the multi-line mode (they match "a\n" in all three modes)?

    I think the key here is that, absent /m: $ matches end-of-string or just before a newline at end-of-string. So, /^a$/ matches "a" and "a\n", but not "a\n\n" or "a\nq" -- the latter two because there is something after the "\n" that the $ matches at.

    Now: /^a$\n/ is a bit odd. It matches "a\n", which I think we can read as: (a) the $ successfully matching in front of the \n, and then the \n matching the \n. It does not match "a" -- because while the $ matches, the \n does not. It also does not match "a\n\n" -- because the $ does not match. [m/.+$\n/ can be read as requiring a not-empty string terminated by exactly one \n.]

    Looking at /^a$/m, now $ will match end-of-string or just before a newline anywhere in the string. So now it matches "a\n\n" because $ matches before the first \n (under /m it doesn't matter that it's not at end of string), then the \n matches the first \n in the string.

    And /^a$\n/m, matches "a\n\n" because $ matches before the first \n, then the \n matches the first \n in the string.

    In passing, I note that Perl accepts m'^a$q' which can never match... unless /m is somehow implied (eg Regexp::Autoflags ?) Perhaps it's just too hard to spot the degenerate case ?

    Update: with thanks to AnomalousMonk for pointing out my soggy thinking, below -- of course, when $ matches a \n it matches before it. So only m/$\n/ can hope to match ! (I knew that, dammit.)

      [...] Perl accepts  m'^a$q' which can never match... unless  /m is somehow implied [...]
      With or without the /m modifier, it can never match against any string whatsoever because as the regex is defined,  $ is required to match before something other than an end-of-string or newline:  'q' follows it in the regex.

      If the regex is defined with a newline to follow the  $ metacharacter, if the  /m modifier is used and if the interpolation-suppressing  ' (single-quote) character is used as the regex delimiter, then a match is possible against a string with an embedded newline:

      >perl -wMstrict -le "my $s = qq{a\nq}; print $s =~ m'^a$q' ? ' ' : 'NO ', 'match'; print $s =~ m'^a$q'm ? ' ' : 'NO ', 'match'; print $s =~ m'^a$\nq' ? ' ' : 'NO ', 'match'; print $s =~ m/^a$\nq/m ? ' ' : 'NO ', 'match'; print $s =~ m'^a$\nq'm ? ' ' : 'NO ', 'match'; " NO match NO match NO match NO match match
Re: When exactly do Perl regex's require a full match on a string?
by jwkrahn (Monsignor) on Feb 08, 2009 at 15:54 UTC

    Zero-width assertions like  ^, $, \A, \Z, \z, \b, \B and \G match at a position in a string, they do not match a character.   So  $ and \Z will normally match at the position before the newline at the end of the string unless a) there is no newline at the end of the string, or b) the pattern before the assertion would also match a newline.

    $ perl -e' use Data::Dumper; $Data::Dumper::Useqq = 1; for ( "ab\ncd", "ab\ncd\n" ) { /\w*$/ && print Dumper $&; /\w*\Z/ && print Dumper $&; /.*$/ && print Dumper $&; /.*\Z/ && print Dumper $&; /.*$/m && print Dumper $&; /.*\Z/m && print Dumper $&; /.*$/s && print Dumper $&; /.*\Z/s && print Dumper $&; print "\n"; } ' $VAR1 = "cd"; $VAR1 = "cd"; $VAR1 = "cd"; $VAR1 = "cd"; $VAR1 = "ab"; $VAR1 = "cd"; $VAR1 = "ab\ncd"; $VAR1 = "ab\ncd"; $VAR1 = "cd"; $VAR1 = "cd"; $VAR1 = "cd"; $VAR1 = "cd"; $VAR1 = "ab"; $VAR1 = "cd"; $VAR1 = "ab\ncd\n"; $VAR1 = "ab\ncd\n";

    Also, you are using the /s modifier which only effects whether the . metacharacter will match a newline or not, and you are not using the . metacharacter in your patterns.

      So $ and \Z will normally match at the position before the newline at the end of the string unless a) there is no newline at the end of the string, or b) the pattern before the assertion would also match a newline.

      Zero-widthness was also my starting point, but it is exactly what raised the question I asked. My sense is that Oshalla has it right when he says "absent the m modifier, $ matches end-of-string or just before a newline at end-of-string".

      As the examples below show, absent the m modifier, '$' does not match [before] an internal new line, but it is perfectly happy matching [before] a final newline after an internal newline:

      string=<\n\n> no modifier: regex=/$\n\n/ no match => $ needs m modifier to match internal nl m modifier (multi line mode): regex=/$\n\n/m match => $ needs m modifier to match internal nl string=<\n\n> no modifier: regex=/\n$\n/ match => $ matches final nl after internal nl m modifier (multi line mode): regex=/\n$\n/m match => $ matches final nl after internal nl string=<\n\n> no modifier: regex=/\n\n/ match => m modifier (multi line mode): regex=/\n\n/m match =>

      Best, beth

      Update: added [before] to make it clearer that the zero-widthness of '$' wasn't at issue, but rather which newline was being matched by the zero-width '$' - thanks jwkrahn for pointing out that it wasn't clear that was meant.

        A regex like /\n$\n/ doesn't really make any sense since matching after the end of the string is like asking for the 11th value of a 10 value array. So whether it matches "\n\n" is quite academic. I could live with a perl that does not have a consistent answer for this. The important cases IMO are:

        > perl -e ' $_="a\n"; print "match\n" if (m/^a\n$/); ' match > perl -e ' $_="a\n"; print "match\n" if (m/^a$/); ' match

        Which means you can match the \n if you want, but you don't need to.

        Here's the problem: With or without the /m modifier, the regex  /\n$\n/ does not match against the  "\n\n" string!
        >perl -wMstrict -le "my $s = qq{\n\n}; print $s =~ /(\n$\n)/ ? qq{:$1:} : 'no match'; " no match >perl -wMstrict -le "my $s = qq{\n\n}; print $s =~ /(\n$\n)/m ? qq{:$1:} : 'no match'; " no match
        The reason is that the  $\ sequence in the regex is taken as the  $\ 'output record separator' Perl special variable (a newline by default) and interpolated as such in the regex, which thus becomes equivalent to  / \n \n n /x (note the /x modifier).

        If the regex is disambiguated as  / \n $ \n /x (again, note the /x modifier), the regex matches both with and without the /m modifier.

        >perl -wMstrict -le "my $s = qq{\n\n}; print $s =~ /( \n $ \n )/x ? qq{:$1:} : 'no match'; " : : >perl -wMstrict -le "my $s = qq{\n\n}; print $s =~ /( \n $ \n )/xm ? qq{:$1:} : 'no match'; " : :
        In many of the examples in other replies in this thread, the ambiguity of  $\ in a regex that arises from interpolation is not taken into account and causes (or can cause) confoosion.

        Update: Consider the following misleading output from the OP:

        string=<a\n> no modifier: regex=/^a$\n/ match => $ matches only boundary, \n matches newline [ ... ] m modifier (multi line mode): regex=/^a$\n/m match => $ matches only boundary, \n matches newline
        In fact, neither regex matches:
        >perl -wMstrict -le "my $s = qq{a\n}; print $s =~ /^a$\n/ ? ' ' : 'NO ', 'match'; print $s =~ /^a$\n/m ? ' ' : 'NO ', 'match'; " NO match NO match
        The reason for the confusion is that the regex is first defined as  '^a$\n' (i.e., within non-interpolating single-quotes) in the test code, then interpolated within the actual  // regex operator, in which case the  $\ sequence is not ultimately interpolated as the output record separator string.

        Again, after appropriate disambiguation, everything's fine:

        >perl -wMstrict -le "my $s = qq{a\n}; print $s =~ /^ a $ \n/x ? ' ' : 'NO ', 'match'; print $s =~ /^ a $ \n/xm ? ' ' : 'NO ', 'match'; " match match
        As the examples below show, absent the m modifier, '$' does not match an internal new line, but it is perfectly happy matching a final newline after an internal newline:

        Therein may lie your problem.   A newline is a character and  $ is a zero-width assertion which will never match a character.    :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://742253]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (16)
As of 2014-07-10 20:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (215 votes), past polls