|Perl: the Markov chain saw|
Re^7: Looking for ideas on how to optimize this specialized grepby furry_marmot (Pilgrim)
|on Jan 28, 2011 at 00:55 UTC||Need Help??|
You're confusing a couple of things. A negation class just means match something that is not this class of chars. [^abc]+ means match one or more of anything that isn't a, b, or c. It has nothing to do with backtracking.
Read perlretut for sure, and see "Backtracking" in perlre. Generally it's not something you have to worry about unless you have a regex that's running really slowly.With regard to the match above, it's coded to look at a specific pattern of email -- and it's not all-inclusive -- it just determines whether it matches a pattern that *I* say is or is not spam.
According to my admittedly arbitrary rules, if the display name ("Furry Marmot") is consistent with the local-part of the address (the part before the @ sign: "marmot"), this is a valid address. Also, just the email address without the display name is fine.
But the regex I wrote tests for something that doesn't match that pattern. It says, IF there is something between quotes, BUT that something doesn't include "Furry Marmot", AND the address is "<firstname.lastname@example.org>", THEN it's spam. So the regex matches my definition of spam, failing on not-spam.
So number 1 fails because the regex tests for 'not .+Furry Marmot' between quotes but finds 'Furry Marmot' followed by '<email@example.com>'. The match fails, so it is not spam.
Number 2 also fails because we're testing for 'not .+Furry Marmot' and 'Mr.Furry Marmot' actually is '.+Furry Marmot'. But 'pharmacy' is definitely 'not Furry Marmot'; it's followed by the marmot email address, so Number 3 is spam.
Number 4 and 5 fail because the regex is looking for To: "something between quotes".... There are no quotes at all, so it fails quickly, and failure equals not-spam, so they're both not spam.
Obviously one could come up with a much better regex than my off-the-cuff, narcisstic example. :-) I was thinking of a spam catcher I worked on a long time ago, that included a series of patterns to try matching against a header block. One of the patterns I remember is emails addressed to "Online Pharmacy", but with my address. Another pattern was emails from me, to me, which wouldn't happen with those particular email accounts. But you get into all kinds of issues like, "Would I send an email to myself?" For a lot of people, the answer might be yes. And what about something sent to "Subscriber" <firstname.lastname@example.org>? Is that valid? And what do you do with "Funky Marmot" <email@example.com>? Oooooo, Spam Assassin starts looking good very quickly.