Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^7: Looking for ideas on how to optimize this specialized grep

by furry_marmot (Pilgrim)
on Jan 28, 2011 at 00:55 UTC ( #884700=note: print w/ replies, xml ) Need Help??


in reply to Re^6: Looking for ideas on how to optimize this specialized grep
in thread Looking for ideas on how to optimize this specialized grep

You're confusing a couple of things. A negation class just means match something that is not this class of chars. [^abc]+ means match one or more of anything that isn't a, b, or c. It has nothing to do with backtracking.

Read perlretut for sure, and see "Backtracking" in perlre. Generally it's not something you have to worry about unless you have a regex that's running really slowly.

With regard to the match above, it's coded to look at a specific pattern of email -- and it's not all-inclusive -- it just determines whether it matches a pattern that *I* say is or is not spam.
1 To: "Furry Marmot" <marmot@furrytorium.com> 2 To: "Mr.Furry Marmot" <marmot@furrytorium.com> 3 To: "pharmacy" <marmot@furrytorium.com> 4 To: <marmot@furrytorium.com> 5 To: <test@furrytorium.com>

According to my admittedly arbitrary rules, if the display name ("Furry Marmot") is consistent with the local-part of the address (the part before the @ sign: "marmot"), this is a valid address. Also, just the email address without the display name is fine.

But the regex I wrote tests for something that doesn't match that pattern. It says, IF there is something between quotes, BUT that something doesn't include "Furry Marmot", AND the address is "<marmot@furrytorium.com>", THEN it's spam. So the regex matches my definition of spam, failing on not-spam.

So number 1 fails because the regex tests for 'not .+Furry Marmot' between quotes but finds 'Furry Marmot' followed by '<marmot@furrytorium.com>'. The match fails, so it is not spam.

Number 2 also fails because we're testing for 'not .+Furry Marmot' and 'Mr.Furry Marmot' actually is '.+Furry Marmot'. But 'pharmacy' is definitely 'not Furry Marmot'; it's followed by the marmot email address, so Number 3 is spam.

Number 4 and 5 fail because the regex is looking for To: "something between quotes".... There are no quotes at all, so it fails quickly, and failure equals not-spam, so they're both not spam.

Obviously one could come up with a much better regex than my off-the-cuff, narcisstic example. :-) I was thinking of a spam catcher I worked on a long time ago, that included a series of patterns to try matching against a header block. One of the patterns I remember is emails addressed to "Online Pharmacy", but with my address. Another pattern was emails from me, to me, which wouldn't happen with those particular email accounts. But you get into all kinds of issues like, "Would I send an email to myself?" For a lot of people, the answer might be yes. And what about something sent to "Subscriber" <marmot@furrytorium.com>? Is that valid? And what do you do with "Funky Marmot" <marmot@furrytorium.com>? Oooooo, Spam Assassin starts looking good very quickly.


Comment on Re^7: Looking for ideas on how to optimize this specialized grep
Select or Download Code
Re^8: Looking for ideas on how to optimize this specialized grep
by remiah (Hermit) on Jan 28, 2011 at 07:51 UTC

    >Number 4 and 5 fail because the regex is looking for To: 
    >"something between quotes".... There are no quotes at all, 
    >so it fails quickly, and failure equals not-spam, so they're 
    >both not spam.
    
    I get it. No.5 is not spam for this regex filter. NO.4 and No.5 fails because simply double quotes doesn't exist! I was totally confused with your comment below.
    # The email address should be "Furry Marmot" <marmot@furrytorium.com>, or just
    # marmot@furrytorium.com. Anything else is spam.
    
    I was thinking that there must be some magic to judge "just <test@furrytorium.com>" as a Spam. But test script says it is not a Spam...??? Now it's clear.

    Anyway I should look perlre also. Thanks for kind explanation to a bad student. regards.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://884700]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (12)
As of 2014-09-19 14:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (140 votes), past polls