Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

How do I optimize a regular expression?

by kyle (Abbot)
on Dec 07, 2009 at 15:54 UTC ( #811525=perlquestion: print w/ replies, xml ) Need Help??
kyle has asked for the wisdom of the Perl Monks concerning the following question:

My general question is how to improve the performance of a regular expression. I have a fairly simple log analysis program, and I've profiled it, and it spends more time matching the regular expressions that I've made for it than anything else. This may be because my input is rather large, and my "analysis" is to feed it to Text::CSV.

In any case, it occurred to me that I have no idea how to improve matching performance. I'm assuming there is not something like Devel::NYTProf for regular expressions. I don't know much about how the engine works under the hood, so I can't make any guesses about what might be taking the most time. I know that it's possible to write an expression that takes a year and a day to match, but I don't know the particulars.

The actual expressions I'm using are in the readmore block below. My input files are Apache logs in a somewhat customized format.

# Things like ( (?:internal_ip){0} . ) were originally Perl 5.10 # named captures like (?<internal_ip> . ), but I don't # expect to have 5.10 on the target. # [04/Oct/2009:06:25:20 -0500] my $time_rx = qr{ \[ ( (?:day){0} \d\d ) / ( (?:mon){0} Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec ) / ( (?:year){0} \d{4} ) : ( (?:hour){0} \d\d ) : ( (?:min){0} \d\d ) : ( (?:sec){0} \d\d ) \s ( (?:tz){0} [+-] \d{4} ) \] }xms; my @time_rx_fields = qw( day mon year hour min sec tz ); my $request_rx = qr{ \" ( (?:method){0} [A-Z]+ ) \s ( (?:url){0} .*? ) \s ( (?:protocol){0} HTTP/1.\d ) \" }xms; my @request_rx_fields = qw( method url protocol ); my $ip_rx = qr{ [12]?\d?\d (?: \. [12]?\d?\d ){3} }xms; my $pid_log_rx = qr{ ^ ( (?:user){0} \S+ ) \s $time_rx \s $request_rx \s ( (?:status){0} \d{3} ) \s ( (?:bytes){0} - | \d+ ) \s \[ ( (?:pid){0} \d+ ) \] \s* $ }xms; my @pid_log_rx_fields = ( 'user', @time_rx_fields, @request_rx_fields, 'status', 'bytes', 'pid' ); my $frontend_log_rx = qr{ ^ ( (?:external_ip){0} $ip_rx ) \s* , \s* ( (?:internal_ip){0} $ip_rx ) \s ( (?:group){0} \S+ ) \s ( (?:user){0} \S+ ) \s $time_rx \s $request_rx \s ( (?:status){0} \d{3} ) \s ( (?:bytes){0} - | \d+ ) \s ( (?:ms){0} \d+ ) \s* $ }xms; my @frontend_log_rx_fields = ( 'external_ip', 'internal_ip', 'group', 'user', @time_rx_fields, @request_rx_fields, 'status', 'bytes', 'ms' ); # Those get used here: sub line_in { my ( $fh, $rx, $fields_ref ) = @_; return if ref $fields_ref ne ref []; my $line = <$fh>; return if ! defined $line; my @captures = $line =~ $rx; return if ! @captures; return { line => $line, map { $fields_ref->[$_] => $captures[$_] } 0 .. $#{$fields_ref} }; }

I'd welcome any suggestions on this particular problem, but I'm really interested in the more general question of how I could answer this question myself. Are there heuristics you've learned? Is there a tutorial on this topic? Any guidance would be appreciated!

Comment on How do I optimize a regular expression?
Download Code
Re: How do I optimize a regular expression?
by rjberry (Novice) on Dec 07, 2009 at 16:07 UTC

    You want to read Mastering Regular Expressions by Jeffrey E.F. Friedl. He dedicates chapters to this, and it's an excellent book.

      Thanks! I've heard great things about that book in the past but I never had a good reason to dive into it. Maybe it's finally time to pick it up.

Re: How do I optimize a regular expression?
by moritz (Cardinal) on Dec 07, 2009 at 16:29 UTC
    There are a few things you can try:

    (?:thing){0} seems to be a poor man's comment. I'd use an ordinary #...\n comment instead, or (?#...) comments.

    Second thing: split on whitespaces as far as possible, and then check the individual fields with anchored regexes

    You can also use (?>...) non-backtracking groups for things that don't have to backtrack. That will make regexes fail faster if they can't match.

    If you have the choice, use perl 5.10 or newer, it has a an awesome optimization for alternatives of literals.

      You're correct about my poor man's comments. I didn't know about (?# ... ) comments or I'd have used that. Thanks!

      To split on white space might be a good idea for this application, and I may try that, but I'm more interested in how to make expressions work faster.

      I don't really understand how backtracking works, so I don't know when (?> ... ) can be used. I probably ought to spend a day in perlre or something.

      I'm actually doing my development with Perl 5.10, but the machine it ultimately has to run on has only 5.8.8. Now I'm really wondering how hard it would be to upgrade it.

      Thanks for your help!

        but I'm more interested in how to make expressions work faster

        Then I'll try to give you a few general hints:

        • Learn about backtracking, and make sure you avoid it wherever possible
          • Anchor your regexes if possible
          • Try to avoid .*? and .*
          • Use backtracking-suppressing groups whenever possible
        • Try to use literal strings where possible. The regex engine is smart enough to anchor them automatically as an optimization (in certain cases)
        • Only capture (with (...)) when you actually need it

        Regexp::Assemble promises (among other things) to bring the power of trie optimizations to earlier perls, maybe it's worth a try (and less hassle than updating your perl version).

Re: How do I optimize a regular expression?
by BrowserUk (Pope) on Dec 07, 2009 at 16:38 UTC
    how to improve the performance of a regular expression

    It's hard work! If it is important enough for you to spend your time on, (and you're a masocist!), then try enabling use re 'debug';. This will tell you FMTYEWTK, about what is going on inside the regex engine.

    As for "Are there heuristics you've learned?", the most basic one is reScriptWithDebug > log & wc -l log. Lower numbers mean "more efficient"!

    Essentially, the less log your re generates, the faster it will run (without the logging). Which of course "begs the question", which despite the purist lamentations, I'm going to ask on your behalf. "

    How do you know which regex contructs will generate the least logging?

    And the answer is: suck it and see. (Paraphrase: benchmark!). But to do that, you need a range of possible candidate regexes that meet your actual needs--as opposed to either your stated needs, or those assumed by the pedants.

    Back around the 5.6.1 timeframe, Jeffery Friedl's book, Mastering regular expressions, would give you definitive answers to most regex questions--assuming that you could force yourself to read through, the rather dry, complete works. But, like life, Perl and it's regex engine, moved on. So now, the only way to optimise your regexes, is to benchmark them.

    One possible approach is to realise that the looser (more permissive; least specific), the regex, the faster it is likely to run.

    Another is to post a "please speed this up" post here at PM. But don your thick skin, cos ~70% of the replies will likely be "Optimisation is the route of all evil", knock offs! Without the proofs. Or understanding.

    So, if you have just posted a generic "how to" question, in place of a specific, "I really need to speed up this regex" question, repost the latter. Make it a challenge! Offering mega-qodos.

    Incite the competetive nature of the monks. Maybe, just maybe, you'll re-kindle the interest of some of those that used to make this place so much fun, but that have ceased to intereact here, because the pedants and PC-police have driven away those that made this place work in the first place.

    The japhys, and dwss and aristotles. Those with whom you could have a technical argument, without personal afront. Those who's ingrained desire to know, supplanted the group-speak invictive for control and conformity.

    We can but hope!


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://811525]
Approved by Old_Gray_Bear
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2014-08-29 10:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (277 votes), past polls