Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: Speed of Perl Regex Engine

by flexvault (Monsignor)
on Nov 28, 2012 at 21:26 UTC ( #1006107=note: print w/replies, xml ) Need Help??

in reply to Speed of Perl Regex Engine


I'm not a big regex user, so my comments may not reflect what others have experienced.

A few years ago(2003), we saw an explosion in spam on our email machines to more than 100K emails per day per machine. We were using MailScanner to process the email, and found that it couldn't keep up with the quantity we were receiving. So I wrote a preprocessor with Perl and the quickest and dirties trick was to search on 'unique' phrases in the body of the email to identify email that was 'known' spam before passing the result to MailScanner. The original was about 300 lines of script. Since then it's grown to 5000++ lines and was split into 2 persistent scripts. The average email machines now process more than 1,000,000 emails per day. I use 'Time::HiRes' to time the 'while' loop that tests for spam identified within the body of the email. The basic test is:

my $stime = gettimeofday; $body = lc($body); ## All whitespace and punctuation has be +en removed foreach $var ( @BD_data ) { my $sz = index ( $body, $var ); if ( $sz >= 0 ) { . . . last; } } my $looptime = gettimeofday - $stime; ## This value is logged!
In testing I tried to use a regex figuring I could include the 'lc' as part of the regex. All benchmarks showed the regex to be much slower than using 'lc' with 'index'.

Why this is important to you is that the '$body' averaged 10KB and the '@BD_data' usually had more than 1K elements. And the clients on the email machines that had problems were banks and the '$looptime' rarely exceeded 100ms. '@BD_data' is ordered by the frequency of spam activity, so the most common 'spam' term is first.

So my suggestion is to try using 'index' and see if it helps.

Good Luck!

"Well done is better than well said." - Benjamin Franklin

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1006107]
What's the matter? Cat got your tongue?...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2017-07-28 07:42 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (424 votes). Check out past polls.