Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Regex Critic?

by QM (Vicar)
on Nov 18, 2013 at 15:28 UTC ( #1063125=perlquestion: print w/ replies, xml ) Need Help??
QM has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for a regex critic or list of traps or otherwise "(Almost) Never Do This in Regex" lists.

In our $shop we have a static analyzer for several languages, and one thing I noticed is it doesn't catch odd stuff like m/^.*blah.*$/, and other bad practices (it doesn't even try). Perl Critic doesn't seem to do this either.

I don't really need something automated, though that would be nice. A collection of "Bad Programmer" examples would be appreciated.

Any good leads for this? Perhaps a bolt-on module for YAPE::Regex?

-QM
--
Quantum Mechanics: The dreams stuff is made of

Comment on Regex Critic?
Download Code
Re: Regex Critic?
by toolic (Chancellor) on Nov 18, 2013 at 15:47 UTC
    A couple of comments:
    • YAPE::Regex has not kept up with recent Perl versions. Here is a comment from YAPE::Regex::Explain LIMITATIONS: "There is no support for regular expression syntax added after Perl version 5.6, particularly any constructs added in 5.10". It can be used, just not for catching newer regex features.
    • Perl::Critic can be extended. Refer to: EXTENDING THE CRITIC
      Thanks. Good point about YAPE::Regex, I hurried past that on the first reading.

      I'll look into Perl::Critic extensions, as we already specify using it for production code.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: Regex Critic?
by LanX (Canon) on Nov 18, 2013 at 21:30 UTC
    Not that easy.

    There are different categories of "bad practice" like unmaintainable code or dangerous code.

    It's possible to built little regexes which run for years just b/c backtracking goes exponential in Perl but not in egrep.

    So this depends on the engine used and involved optimization. (that much about "static analyzer for several languages")

    At the same time the code you showed m/^.*blah.*$/ might be odd but not necessarily slower, cause AFAIK will Perl first try to find the string "blah" somewhere, before continuing.

    And it's NOT IDENTICAL to m/blah/ cause multilines wouldn't match w/o /s modifier.

    DB<100> $str="\n blah \n" blah DB<101> $str =~ /blah/ => 1 DB<102> $str =~ /^.*blah.*$/ DB<103> $str =~ /^.*blah.*$/s => 1

    So what's the recommended best practice here in your opinion?

    Regarding analyzers, you might be interested in this recent discussion, which had pointers to different projects and older discussions:

    Parsing and translating Perl Regexes

    HTH! =)

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      Thanks for the pointers. I've read somewhere else, for instance, a recommendation to always use /ms unless there's a good reason not to. Also educated_foo's suggestion below for /a and /aa.

      Ultimately I'd like to collect a reasonable list for Perl REs, and where necessary, extend/amend that for non-Perl.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: Regex Critic?
by educated_foo (Vicar) on Nov 18, 2013 at 22:06 UTC
    PBP probably has some tips, but other than that, I can't think of anything off the top of my head, neither a program nor a list. To start one...
    • ".*" is greedy, which will probably do something you don't expect in the future, so think about how you're using it.
    • be careful with "^" and "$" when not doing line-oriented string processing. But don't sweat it -- most of the time you are, so they do the right thing.
    • be explicit when you might have to touch Unicode, e.g. "\d" matches all kinds of junk that the rest of Perl doesn't think is a digit. IMHO, all regexes should have the new "/aa" modifier on them, unless you know the reason they shouldn't.
    • ...
      Thanks for the comments. .* always sets off the alarm in my head, as it is very rarely necessary. Similarly, constructs like m/^.*blah.*$/ are overkill in many cases, and make it harder to see what's doing the real work.

      I've had a look at the /aa modifiers, and I think that might make our list. And the use re '/aa' pragma might be the best way to do that for our code.

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1063125]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2014-09-19 05:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (129 votes), past polls