Beefy Boxes and Bandwidth Generously Provided by pair Networks Ovid
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Speed of Perl Regex Engine

by flexvault (Vicar)
on Nov 28, 2012 at 21:26 UTC ( #1006107=note: print w/ replies, xml ) Need Help??


in reply to Speed of Perl Regex Engine

Clovis_Sangrail,

I'm not a big regex user, so my comments may not reflect what others have experienced.

A few years ago(2003), we saw an explosion in spam on our email machines to more than 100K emails per day per machine. We were using MailScanner to process the email, and found that it couldn't keep up with the quantity we were receiving. So I wrote a preprocessor with Perl and the quickest and dirties trick was to search on 'unique' phrases in the body of the email to identify email that was 'known' spam before passing the result to MailScanner. The original was about 300 lines of script. Since then it's grown to 5000++ lines and was split into 2 persistent scripts. The average email machines now process more than 1,000,000 emails per day. I use 'Time::HiRes' to time the 'while' loop that tests for spam identified within the body of the email. The basic test is:

my $stime = gettimeofday; $body = lc($body); ## All whitespace and punctuation has be +en removed foreach $var ( @BD_data ) { my $sz = index ( $body, $var ); if ( $sz >= 0 ) { . . . last; } } my $looptime = gettimeofday - $stime; ## This value is logged!
In testing I tried to use a regex figuring I could include the 'lc' as part of the regex. All benchmarks showed the regex to be much slower than using 'lc' with 'index'.

Why this is important to you is that the '$body' averaged 10KB and the '@BD_data' usually had more than 1K elements. And the clients on the email machines that had problems were banks and the '$looptime' rarely exceeded 100ms. '@BD_data' is ordered by the frequency of spam activity, so the most common 'spam' term is first.

So my suggestion is to try using 'index' and see if it helps.

Good Luck!

"Well done is better than well said." - Benjamin Franklin


Comment on Re: Speed of Perl Regex Engine
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1006107]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2014-04-17 22:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (458 votes), past polls