|laziness, impatience, and hubris|
Re: Speed of Perl Regex Engineby flexvault (Prior)
|on Nov 28, 2012 at 21:26 UTC||Need Help??|
I'm not a big regex user, so my comments may not reflect what others have experienced.
A few years ago(2003), we saw an explosion in spam on our email machines to more than 100K emails per day per machine. We were using MailScanner to process the email, and found that it couldn't keep up with the quantity we were receiving. So I wrote a preprocessor with Perl and the quickest and dirties trick was to search on 'unique' phrases in the body of the email to identify email that was 'known' spam before passing the result to MailScanner. The original was about 300 lines of script. Since then it's grown to 5000++ lines and was split into 2 persistent scripts. The average email machines now process more than 1,000,000 emails per day. I use 'Time::HiRes' to time the 'while' loop that tests for spam identified within the body of the email. The basic test is:
In testing I tried to use a regex figuring I could include the 'lc' as part of the regex. All benchmarks showed the regex to be much slower than using 'lc' with 'index'.
Why this is important to you is that the '$body' averaged 10KB and the '@BD_data' usually had more than 1K elements. And the clients on the email machines that had problems were banks and the '$looptime' rarely exceeded 100ms. '@BD_data' is ordered by the frequency of spam activity, so the most common 'spam' term is first.
So my suggestion is to try using 'index' and see if it helps.
"Well done is better than well said." - Benjamin Franklin