Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??


I'm not a big regex user, so my comments may not reflect what others have experienced.

A few years ago(2003), we saw an explosion in spam on our email machines to more than 100K emails per day per machine. We were using MailScanner to process the email, and found that it couldn't keep up with the quantity we were receiving. So I wrote a preprocessor with Perl and the quickest and dirties trick was to search on 'unique' phrases in the body of the email to identify email that was 'known' spam before passing the result to MailScanner. The original was about 300 lines of script. Since then it's grown to 5000++ lines and was split into 2 persistent scripts. The average email machines now process more than 1,000,000 emails per day. I use 'Time::HiRes' to time the 'while' loop that tests for spam identified within the body of the email. The basic test is:

my $stime = gettimeofday; $body = lc($body); ## All whitespace and punctuation has be +en removed foreach $var ( @BD_data ) { my $sz = index ( $body, $var ); if ( $sz >= 0 ) { . . . last; } } my $looptime = gettimeofday - $stime; ## This value is logged!
In testing I tried to use a regex figuring I could include the 'lc' as part of the regex. All benchmarks showed the regex to be much slower than using 'lc' with 'index'.

Why this is important to you is that the '$body' averaged 10KB and the '@BD_data' usually had more than 1K elements. And the clients on the email machines that had problems were banks and the '$looptime' rarely exceeded 100ms. '@BD_data' is ordered by the frequency of spam activity, so the most common 'spam' term is first.

So my suggestion is to try using 'index' and see if it helps.

Good Luck!

"Well done is better than well said." - Benjamin Franklin

In reply to Re: Speed of Perl Regex Engine by flexvault
in thread Speed of Perl Regex Engine by Clovis_Sangrail

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others pondering the Monastery: (5)
    As of 2018-05-20 18:34 GMT
    Find Nodes?
      Voting Booth?