Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Regex Misuse

by srawls (Friar)
on May 14, 2001 at 04:34 UTC ( #80126=perlmeditation: print w/ replies, xml ) Need Help??

I know most of you already know this, but I have seen a lot of misuse of regexes, so I thought I'd write this for begginers to read.

Many people get 'regex happy,' and use them when one could use a much faster function. Here's an example:

m/.{$width}/
This is some advice a notable perl monk gave someone asking about how to implement fixed-width columns for data files. The goal was to extract $width amount of characters from a variable. More experienced programmers are shaking their heads right now, they know it would be much more efficient to use this:
substr($someVar,$offSet,$width)
You see, regexes are a very powerful tool, but they are not fast (well, relatively speaking). It is much faster to say, "take this many bytes from this variable, starting at this position in the string," than it is to say, "take the input and see if "." matches the next character, and than repeat that for this number of times." Also, with the regex, it has to be compiled, and then used, all of which is done in polynomial time (read: not fast).

Another common thing is to try to match the whole string, when you only need to match part of it. Here are some examples:

$text =~ s/(.*)($string_1)(.*)($string_2)(.*)/$1$4$3$2$5/ #taken from recent post if ($text =~ m/^.*(\.txt)$/) #used to see if $text ends in .txt
The first example (actually taken from this site) can be improved by taking the first and last (.*) out; we don't need to match the beginning and end, that's not what we are switching around (the regex is used to swap string_1 and string_2). The second regex (just an example I made up now) is being implemented to see if a file ends in .txt. It is extremely wasteful, we only need to match the .txt part, not the whole string. The improved regexes are below:
$text = s/($string_1)(.*)($string_2)/$3$2$1/ if ($text =~ m/\.txt$/)

There are probablly other common mistakes, but I'm just writing this on things in my head now, that I have recently seen, so if some one else wants to post a common mistake in a reply to this, to try to help the beginners, I would very much appreciate that.

The 15 year old, freshman programmer,
Stephen Rawls

Comment on Regex Misuse
Select or Download Code
Re: Regex Misuse
by dws (Chancellor) on May 14, 2001 at 08:25 UTC
    More experienced programmers are shaking their heads right now, they know it would be much more efficient to ...

    The truly experienced programmer doesn't conflate efficiency with effectiveness, but that's worth a separate discussion.

    Using a regex to parse fixed-width fields does indeed sound like overkill, but using substr isn't necessarily the most effective alternative. Splitting a record with multiple fixed-width fields requires multiple calls to substr, along with cascading math to make sure that the substr's start at the right places. Contrast that with a regexp-based solution, which requires a single invocation of a regexp, and no additional bookeeping. And then there's the oft-overlooked unpack.

    The psgrep example in The Perl Cookbook demonstrates a method for separating fixed-width fields by constructing a format string for unpack. It also demonstrates an elegant way of documenting how the format string is assembled.

Re: Regex Misuse
by jepri (Parson) on May 14, 2001 at 11:41 UTC
    The goal was to extract $width amount of characters from a variable. More experienced programmers are shaking their heads right now, they know it would be much more efficient to use this: substr($someVar,$offSet,$width)

    And unpack would be even better yet.

    ____________________
    Jeremy
    I didn't believe in evil until I dated it.

      I thought it was funny how he first labelled one way of doing things as a 'mistake', and then proceeded to make the 'mistake' himself:

      if ($text =~ m/\.txt$/)

      Why not use substr ($text, -4, 4); to extract the .txt, and then eq to compare it?

      The reason is, of course, that we write code for readability, not efficiency, for the most part. See "obfuscation" :)

        Well, what I said was that
        m/.{$width}/
        could be better written with substr, and still be readable. I have seen from replies to my post that unpack is better yet. I just made the $text =~ /\.txt$/ regex example up as I was writing the post, so it my not be the best, but the point I wanted to get at is if you don't need to match the whole string, then don't. For example:
        $text =~ m/.*$endPat$/
        can be better written (and still readable) as:
        $text =~ m/$endPat$/
        And as you have pointed out, the above code can be written this way (much less readable, I wouldn't recomend using this)
        substr($text,0-length($endPat)) eq $endPat
        On a last note, I never said it was a mistake not to use subst(), I just wanted to point out that in some cases it is better than using a regular expression.

        The 15 year old, freshman programmer,
        Stephen Rawls

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://80126]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (12)
As of 2014-10-31 09:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (216 votes), past polls