Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Common Regex Gotchas

Perl novices often stumble over a few gotchas when first learning regular expressions. Learning the whys and the workarounds could save you hours of frustration.

Greediness

Perl's regex engine likes to match the longest string possible, by default. This is described as greediness. Most people don't think that way, at least when looking at text. Given the following string and regex, what will be in $1?
my $data = "<tag>this is a line of code</tag> <explanation>this is where I wax poetic about my code</explanation> <tag>this is another example of code</tag>"; if ($data =~ /<tag>(.*)<\/tag>/s) { print "I found =>$1<=\n"; }

If you said "this is a line of code", you're thinking the same thing most people do. Unfortunately, that's not the way Perl thinks:

I found =>this is a line of code</tag> <explanation>this is where I wax poetic about my code</explanation> <tag>this is another example of code<=

The secret lies in the mysterious asterisk (match zero or more of the preceding). When the engine hits it, it jumps ahead to the end of the line and tries to match the next character -- the < character. Since the last character in the string is >, the match fails, and the engine backtracks a character. This continues through e, d, o, c, and /, until it finally reaches the final < in $data.

Knowing that, you now understand the danger of greediness (and, hopefully, also why parsing HTML with a regex can be tricky). The solution is very simple:

if ($data =~ /<tag>(.*?)<\/tag>/s) { print "I found =>$1<=\n"; }

Using the ? after a normally-greedy quantifier (* or +) tells the engine not to grab the longest string, but the first string that matches the whole pattern.

Specifying Too Much

This gotcha is more stylistic, but it can come back to haunt you later. Remember that regular expressions can be somewhat vague -- you don't have to specify the entire line, if you're only looking for a certain portion. Suppose that you want to find the word Serial, followed by a colon and then a nine-digit number. The data lines might look like this: my $line = "Name: Some Soldier, Rank:  Leftenant, Serial: 426879824, Boots:  black";
A regex novice might bite off more than he could chew with the following: $line =~ /^\w*: \w*\s*\w*, \w*: \w*, \w+: (\d)*, \w*: \w*$/;
If all you're interested in is the Serial number, only ask for that. It'll make your regex simpler, and it will handle deviations from what you think the line ought to look like. (That happens more often than you want to think.) $line =~ /[Ss]erial: (\d{9})/;
Caveat: There are good reasons to break my Rule of Simplicity. Performance is one, and error handling is another. Be sure that the code works first, though, then try to make it tricky.

Special Characters

Don't forget that certain characters (like ., *, /, +, and ?) have special meanings within regular expressions. If you don't have a Unixy background (where escaping characters with a backslash is a little more common), you might write something like this, and stare at it in confusion for a while: $line =~ /<title>(.*?)</title>/;
Hmm. Check the perlman:perlre page for the skinny on exactly which characters have special meaning. Also be aware that choosing alternate delimiters can help out, as well as being more visually appealing: $line =~ m!<title>(.*?)</title>!;
One other caveat is that, within a character class, these rules often don't apply:
my $line = "a.b.cd*f."; $line =~ /([^.*]{2})/;

Simple Substitutions

Want to make sure user input is completely uppercased? Here's one approach:
my $input = "foo bar baz"; $input =~ s/(\w+)/uc($1)/ge;
While that works, it's serious overkill. Even a less picky approach is sub-optimal: $intput =~ s/([A-Za-z]+)/uc($1)/ge;
Don't forget about the friendly tr/// operator -- it's made for simple substitutions like this. (Of course, if you're working with a locale different than simple English text, you're out of luck). $input =~ tr/a-z/A-Z/;

Regular expressions give you a lot of power at the cost of some speed. Don't get out the chainsaw when a penknife will do.

Update: a few small corrections.


In reply to Common Regex Gotchas by chromatic

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others scrutinizing the Monastery: (9)
    As of 2014-08-27 21:58 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (253 votes), past polls