Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

left side of pattern matching

by BeneSphinx (Sexton)
on Mar 13, 2012 at 16:28 UTC ( [id://959401]=perlquestion: print w/replies, xml ) Need Help??

BeneSphinx has asked for the wisdom of the Perl Monks concerning the following question:

I'm confused about the left side of pattern matching operators as they pertain to escaped characters. For the following program:
use strict; use warnings; my $hello = "\n |\t\r"; if ($hello =~ m/^[ \|\t\r\n]+$/){ print "yep this matches"; } else { print "no this doesn't match"; } exit;
This performs about as I "wanted" i.e. that it matches the $hello string, but I'm still a little confused as to why. From my reading of the Perl docs I know there are two passes: a limited double quote parsing followed by a regular expression interpretation. I would expect that in the first pass, those characters would be converted to their "real" forms, like actual newlines and tabs, and the backslash simply removed from before the pipe. However, it seems this isn't happening. So, does that mean that the only form of interpretation in the first pass is to interpolate variables, or is there anything else? Thanks for your clarification.

Replies are listed 'Best First'.
Re: left side of pattern matching
by jwkrahn (Abbot) on Mar 13, 2012 at 16:48 UTC

    Yes, perl interpolates double quoted strings, including regular expression patterns used in the match and substitution operators.    The gory details are in: Gory details of parsing quoted constructs.

    If you don't want string interpolation you can use single quoted operators:

    if ( $hello =~ m'^[ \|\t\r\n]+$' ) {

    But then "\t", "\r" and "\n" won't be interpolated.

Re: left side of pattern matching
by dave_the_m (Monsignor) on Mar 14, 2012 at 00:13 UTC
    Literal regexes actually get 3 passes. The first pass, equivalent to single-quoted literals, just processes delimiters, plus backslash-delimiter and backslash-backslash to find the end of the string and extract it.

    The second pass (equivalent to double-quoted literals) splits the string up into chunks of constant string mixed with variable access, \U etc; for example, "ab$c\Ud" gets converted into 'ab' . $c . uc('d'). For a normal double-quoted literal string, it would at this point also process all backslashy stuff, e.g. \n, \x{100}; however, for a regexp literal, this part is skipped.

    Finally for regexps only, in the third stage the assembled string is passed to the regexp engine to be compiled. Here, the two characters \ and n are converted into a regexp op to match a newline, etc.

    Note the difference this can make depending on whether it's a regex literal or a string literal:

    $foo =~ /\b/; # matches a word boundary $foo =~ "\b"; # matches a backspace
    Similarly,
    $s = '\n'; # note that's 2 chars, not a newline "\n" =~ /$s/; # matches

    Dave.

Re: left side of pattern matching
by LanX (Saint) on Mar 13, 2012 at 17:34 UTC
    I think that your confusion comes from escaping the pipe, but since it is within a character-class [...] you don't need to escape it so it's the same both ways.

    DB<100> $hello = "\n |\t\r"; => "\n |\t\r" DB<101> $hello =~ m/^[ \|\t\r\n]+$/ # matches => 1 DB<102> $hello =~ m/^[ |\t\r\n]+$/ # matches too => 1

    Cheers Rolf

Re: left side of pattern matching
by furry_marmot (Pilgrim) on Mar 13, 2012 at 20:13 UTC

    They are "converted" (interpolated, actually), but you are matching apples to apples, so it doesn't matter. First, the backslash is simply a representation of an "invisible" character. It's not as simple as "removing the backslash".

    Context is also very important. In a double-quote interpolated string, a pipe '|' is just a pipe. But in a regex, it takes on the meaning of OR in a match. Here's the string you created, translated a bit:
    $hello = "^<newline><space><pipe><tab><carriage return>$"
    I think this is exactly what you expected. Then you set up a match, wherein you ask:
    Does $hello contain one or more of [<space><pipe><tab><carriage return><newline>], which start at the beginning of the string, and end at the end of +the string (this is called anchoring, by the way)?

    And yes it does, as noted above. I'm not sure why you would think it wouldn't match. Regexes are interpolated similarly to quoted strings, but there are additional meanings within a regex that also have to be considered.

    Instead of worrying about passes of interpretation or whatever, just think about what each special character means in the context in which you are using it.

    --marmot

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://959401]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-20 14:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found