Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Analyzing regular expression performance

by Sprad (Hermit)
on May 12, 2006 at 14:36 UTC ( [id://549007]=perlquestion: print w/replies, xml ) Need Help??

Sprad has asked for the wisdom of the Perl Monks concerning the following question:

I've been working with some pretty hairy regexes lately, and sometimes I see performance problems. Certain input with certain expressions can take a very long time to process.

Are there tools or general methods for breaking down an expression and finding out where slowdowns are happening? If you have a poorly-performing expression, how do you tune it?

---
A fair fight is a sign of poor planning.

  • Comment on Analyzing regular expression performance

Replies are listed 'Best First'.
Re: Analyzing regular expression performance
by kvale (Monsignor) on May 12, 2006 at 15:06 UTC
    One of the best cures for slow regexes is to improve your understanding of how regexes work. I suggest perlretut as a starter and "Mastering Regular Expressions" published by O'Reilly.

    A common problem causing exponential blowup of execution time is nested quantifiers:

    (a*)+ (a+)+ (a*)* (a+)* etc.
    There are an exponential number of ways to split up a string of a's into inner and outer quantifier bits, and if the string to match has stuff past that string of a's that can't match, it will try every possibility in the effort to match. Hence the blowup. The solution is to rewrite your regexes to not have nested quantifiers.

    Update: There does exist a directive for debugging regexes (see use re 'debug' in perlretut), but it won't make sense until you have a good understanding of how regexes work.

    -Mark

      That would definitely explain at least some of it. The $sp and $words regexes posted above have that structure.

      ---
      It's all fine and dandy until someone has to look at the code.
        Changing the $sp regex to (?:\s*) cut runtime to almost nothing, so it looks like that's where my problem is.

        I do need to allow comments anywhere in the string, though -- including having comments right after comments. Is there a cleaner way of doing that?

        ---
        A fair fight is a sign of poor planning.

Re: Analyzing regular expression performance
by grinder (Bishop) on May 12, 2006 at 15:31 UTC
    Are there tools or general methods for breaking down an expression

    The 'debug' mode of re will let you see how the expression is compiled, and how it behaves when matching. It is usually obvious to see where backtracking is taking place.

    #! /usr/bin/perl use strict; use re 'debug'; print "easy\n" if 'simple' =~ /simple/; print "hard\n" if 'simple' =~ /^.[i-k]\lM(?:pla|plo|pl)(?:oink)?e$/;

    Another trick is to "leave crumbs" is the pattern, to see how often the engine runs over a particular section of the pattern, for instance:

    perl -le '(q(x) x 100) =~ /^x(((((x{2,5})x{2,5})x{2,4})x?)x?(?{$hit++} +))$/; print $hit'

    • another intruder with the mooring in the heart of the Perl

Re: Analyzing regular expression performance
by ptum (Priest) on May 12, 2006 at 15:07 UTC

    Most of the time, when I've had trouble with regex performance, it has been because I was trying to use the regex to do more than it needed to do, particularly with nested data. Usually a better approach is to break the input up into discrete, manageable tokens and then apply my 'hairy regex' in a more controlled and targeted manner.

    Of course, that doesn't really answer your question. But it may make your question moot ... :)


    No good deed goes unpunished. -- (attributed to) Oscar Wilde
Re: Analyzing regular expression performance
by mantadin (Beadle) on May 12, 2006 at 15:30 UTC
    to analyse a regex, you can use YAPE::Regex::Explain.

    update: one example: perl -MYAPE::Regex::Explain -e 'print YAPE::Regex::Explain->new(qr/(<a\s(?:[^>](?!href))*href\s*)(&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)([^>]+>)/)->explain'

    For the performance, one thing that comes to my mind is the /o qualifier to compile once only. And if You still suffer from perfomance issues, You might want to use a debugger and/or profiler. But in that case, maybe You should first try to 'divide and conquer' the regexp into a few smaller ones.

      For the performance, one thing that comes to my mind is the /o qualifier to compile once only

      Yikes! Don't ever recommend that. It's rarely needed and causes more trouble than it is worth. Perl already does optimizations for patterns that contain no variables and for patterns that do contain variables that don't vary over the lifetime of the code in which the pattern appears, so /o won't buy you anything there. If you really need the "compile once" idea, just use qr/.../

        what You say makes sense to me. At least the 'wont buy you anything' part. What I dont understand: how can the /o cause trouble?
Re: Analyzing regular expression performance
by kwaping (Priest) on May 12, 2006 at 15:02 UTC
    I don't have an answer to your question directly, but posting some examples of these "hairy regexes" (great mental picture) might help with other replies here.

    ---
    It's all fine and dandy until someone has to look at the code.
      Here's the one that inspired me to ask:

      $sql = " create table #foo (variable1 char(20), variable2 char(30) null) /* comments make it slow */ "; my $comment = qr/(?:\/\*.*?\*\/)/; # Simple +match of C-style comment my $sp = qr/(?:\s*$comment?\s*)*/; # This ju +st looks for /* comments */, with whitespace on either side. # Sinc +e comments can be anywhere, use this anywhere whitespace is allowed. # The parens don't capture +, so it won't affect your $1, $2 counts. my $header = qr/create${sp}table${sp}\#?\w+${sp}\(${sp}/i; # "create + table foo (", where table name may have a # mark my $name = qr/[\#\w]+${sp}/; # Variabl +e name, can contain # mark my $type = qr/\w+${sp}/; # Variabl +e type (e.g. 'char', 'number') my $op_char = qr/(?:\([\d\s,]+\)${sp})?/; # Bounds +for array types, can be multidimensional (e.g. '(10, 20)', '(30)') my $words = qr/(?:[\w]+(?:\([^\)]*\))?[${sp})]*)*/; # Text af +ter variable type -- may be 'null', 'not null', a 'check' block, etc. my $subst = qr/${sp}\{[^\}]+\}${sp}/; # Substit +ution placeholder -- some text in {curlies} my $decl = qr/($name $type $op_char $words)/x; # A compl +ete variable declaration my $comma = qr/,${sp}/; # Literal + comma, with possible trailing space my $rest = qr/\)${sp}/; # Closing + the block with a close-paren if ($sql =~ / ($header) ((?: $decl $comma | $subst $comma | $subst)* (?: $decl | $subst)) ($rest) /x) { print "Match!\n"; print "1: >>>$1<<<\n"; print "2: >>>$2<<<\n"; }
      It takes about 45 seconds on my box. The longer the comment, the longer the runtime.

      ---
      A fair fight is a sign of poor planning.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://549007]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2025-05-23 02:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.