Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Speed of Perl Regex Engine

by Clovis_Sangrail (Beadle)
on Nov 28, 2012 at 16:04 UTC ( #1006062=perlquestion: print w/ replies, xml ) Need Help??
Clovis_Sangrail has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks,

I use Perl to generate daily audit reports from sets of Jounal Log Files produced by GT.M, an implementation of the MUMPS database/language. Each Journal line of interest includes a Username, a Global Variable and a description of the transaction on it. The report just presents a listing and count of the Global Variable modificatons, broken out by Username.

The customer wanted the capability to ignore some Globals that were not of interest. They can edit a file of such Globals, and I read that file and build an Inclusive-Or type of Regex that I pass to the Perl program as a commandline parameter. The program matches the Global Variable name from each Journal line against that Regex, and skips it if found.

But I did not realize just how popular this capability would be! I figured there would only ever be a few such Globals to skip, but the Customer has entered 54 of them so far, and they say there will be more! The Regex that I give to the Perl program is now about 750 characters long, and some of the bigger banks being audited produce over a million lines of Journal each day.

The reports for those banks do take noticeably longer to produce than when the system first went online, and I don't have much knowledge of or feel for the performance of the Perl Regex engine. Is it linear, like will it take ten times as long to match against a 600-character Regex than against a 60-character one?

I realize that this is just the sort of thing that enterprising Perl students study via test programs, and I may do that sort of thing. But I also do want to be able to tell the folks who sign my check that I am asking around, too.

Comment on Speed of Perl Regex Engine
Re: Speed of Perl Regex Engine
by runrig (Abbot) on Nov 28, 2012 at 16:17 UTC
    Does it need to be a regex? If you can live with an exact match, then I'd use a hash:
    my %want_global = map { ($_ => 1) } qw( THIS THAT ANOTHER ); ... if ( $want_global{$global} ) { ... }
    If you really need to have regexes, then put them in an array instead of a single regex, it will generally perform better than a joined single regex (and put the most likely things to match first, if possible):
    my @wanted_re = ( qr/^AB/, qr/^CD/, ); sub want_global { my $g = shift; for my $re (@wanted_re) { return 1 if $g =~ /$re/; } return; }
    Update: I see that maybe you want 'unwanted global' logic...whatever...the above still applies...adjust to suit needs.
Re: Speed of Perl Regex Engine
by moritz (Cardinal) on Nov 28, 2012 at 16:23 UTC
    The reports for those banks do take noticeably longer to produce than when the system first went online

    That sounds as if lots of stuff might have been changed in between. Run a profiler over the script(s) and see where the time is actually spent.

    I don't have much knowledge of or feel for the performance of the Perl Regex engine. Is it linear, like will it take ten times as long to match against a 600-character Regex than against a 60-character one?

    In general, it doesn't depend much on the length of regex, but on the amount of backtracking and searching that the regex engine has to do.

    If it's just a big alternation of constant strings, and you use perl 5.10.0 or newer, the trie optimization in the regex engine should handle that case very well (sub-linear even). If your regex grows too big, try increasing ${^RE_TRIE_MAXBUF} -- but only if it's the regex that's actually slow.

    And as already mentioned, if you can solve your problem through a hash lookup, that would be even better.

Re: Speed of Perl Regex Engine
by Clovis_Sangrail (Beadle) on Nov 28, 2012 at 18:19 UTC

    Thanks runrig & moritz.

    I'm glad that you suggested the use of a hash, I had said the same to my managers, and now I can say independent experts recommend it too.(I'd never heard of "map" before...)

    It'd be interesting to run the profiler, I've never done that.

Re: Speed of Perl Regex Engine
by flexvault (Parson) on Nov 28, 2012 at 21:26 UTC

    Clovis_Sangrail,

    I'm not a big regex user, so my comments may not reflect what others have experienced.

    A few years ago(2003), we saw an explosion in spam on our email machines to more than 100K emails per day per machine. We were using MailScanner to process the email, and found that it couldn't keep up with the quantity we were receiving. So I wrote a preprocessor with Perl and the quickest and dirties trick was to search on 'unique' phrases in the body of the email to identify email that was 'known' spam before passing the result to MailScanner. The original was about 300 lines of script. Since then it's grown to 5000++ lines and was split into 2 persistent scripts. The average email machines now process more than 1,000,000 emails per day. I use 'Time::HiRes' to time the 'while' loop that tests for spam identified within the body of the email. The basic test is:

    my $stime = gettimeofday; $body = lc($body); ## All whitespace and punctuation has be +en removed foreach $var ( @BD_data ) { my $sz = index ( $body, $var ); if ( $sz >= 0 ) { . . . last; } } my $looptime = gettimeofday - $stime; ## This value is logged!
    In testing I tried to use a regex figuring I could include the 'lc' as part of the regex. All benchmarks showed the regex to be much slower than using 'lc' with 'index'.

    Why this is important to you is that the '$body' averaged 10KB and the '@BD_data' usually had more than 1K elements. And the clients on the email machines that had problems were banks and the '$looptime' rarely exceeded 100ms. '@BD_data' is ordered by the frequency of spam activity, so the most common 'spam' term is first.

    So my suggestion is to try using 'index' and see if it helps.

    Good Luck!

    "Well done is better than well said." - Benjamin Franklin

Re: Speed of Perl Regex Engine
by Jenda (Abbot) on Nov 29, 2012 at 16:43 UTC

    You said you build the regex and pass it to the program as a parameter ... how do you use it then?

    my $regexp = @ARGV[1]; ... next if ($global =~ /$regexp/); ...
    my $regexp = @ARGV[1]; ... next if ($global =~ /$regexp/o); ...
    my $regexp = qr/@ARGV[1]/; ... next if ($global =~ $regexp); ...

    The first version compiles the regexp each time you use it, the other two just once. For a longer regexp this may make a big difference.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      I believe in later versions of Perl (>= 5.6), this is no longer true. As long as $regexp doesn't change, /$regexp/ isn't compiled again, making "/o" (mostly) obsolete.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1006062]
Approved by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2014-07-29 22:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (229 votes), past polls