Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re^5: Looking for ideas on how to optimize this specialized grep

by furry_marmot (Pilgrim)
on Jan 26, 2011 at 18:30 UTC ( #884398=note: print w/replies, xml ) Need Help??

in reply to Re^4: Looking for ideas on how to optimize this specialized grep
in thread Looking for ideas on how to optimize this specialized grep

Heh. That was tricky to get to work right. But let me save you some reading. First, [^"]* doesn't suppress backtracking. It's just matches zero or more of anything that isn't a quote. [^...] is a negation class; the carat means don't match any of the characters between the square brackets.

Now let me explain zero-width lookaheads so you know what that code is about. When you do a match against $string, Perl keeps track of the offset from the start of the string, which you can get (or set, actually) with pos($string). It makes more sense when you are doing multiple matches against the same string. Let's say I want to collect all the peppers in the following:

$s = "I'm a pepper, he's a pepper, you're a pepper, she's a pepper.. +."; while( $s =~ /(pepper)/g ) { push @peppers, $1; }
This will put 4 peppers in the array @peppers. As you probably know, the /g modifer tells the match to remember the position after the last match, and the next time through the loop, start looking for another match from that point. So...
I'm a pepper, he's a pepper, you're a pepper, she's a pepper... ^ ^ ^ ^ ^ 0 1 2 3 4
...the first time through the loop, pos($s) is 0. After matching the first time, the offset is 12, at position 1. After the next match, the offset is at position 2, and so on until there are no more matches.

A zero-width lookahead means a) do the match and b) if successful, put the offset back where it was before you started. So, in this example...

$s = "Wouldn't you like to be a pepper too?"; # ^ ^ # 0 1 $s =~ /(?=pepper).+like/;
...the regex starts searching from the start of the line by default and matches pepper. But it doesn't change the offset to position 1. Instead, it leaves it at position 0, where it searches forward to find 'like'.  $s =~ /pepper.+like/ would have failed because after matching pepper, the offset would be at position 1, and searching forward won't find 'like'. The code is the equivalent of $s =~ /pepper.+like|like.+pepper/. It's more useful when parsing complex phrases, like language, where a verb, for example, can be followed by more than one type of word or phrase.

But getting back to your post:

print "Spam!!!\n" if $text =~ /^To: \s* " (?!.+Furry Marmot) [^"]* " <marmot\@furrytorium\.com> /mx;
A negative lookahead is like the positive lookahead, above, but succeeds when the search term is not found. In the middle of the regex above, I want the match to succeed if there are two quotes, but they do not contain 'Furry Marmot'. It won't work if I try to match "(?!.+Furry Marmot)" because that says a) find a double-quote, b) don't find 'Furry Marmot' and then leave the offset just after the quote, and c) find the closing quote. This can only match "".

Instead, once we have determined that Furry Marmot is not after the first quote, match zero or more of anything that isn't another quote, up to the closing quote. Now we can check what's in the email address.

This is just a simplistic example, and probably would be overkill if you tried to accomodate multiple addressees, a CC: or BCC: line, etc., but I hope it helps you learn regexes. They are one of my favorite parts of Perl. :-) There's a very good description of backtracking in perldoc perlretut. That and perldoc perlre will shed a lot of light on this.


Replies are listed 'Best First'.
Re^6: Looking for ideas on how to optimize this specialized grep
by remiah (Hermit) on Jan 27, 2011 at 06:04 UTC
    >First, [^"]* doesn't suppress backtracking.

    In quoted something case like "", <html>, sometimes I saw negator was used as if to supress backtracking.

    $a=q("aaa" and "bbb"); $a =~ s/".*"/test/g; print "back tracked #$a#\n"; $a=q("aaa" and "bbb"); $a=~ s/"[^"]*"/test/g; print "with negator #$a#\n"; $a=q("aaa" and "bbb"); $a=~ s/".*?"/test/g; print "with backtrack supress #$a#\n";
    Do you mean [^"]* backtracks internally?

    My question (what I don't get) is 'Where the double quote gone when this regex matches againt To:<> ?' Below is my simplified further example.

    $text = <<'EOT'; Message-ID: <ODM2bWFpbGVyLmRpZWJlYS40MjYyNjE2LjEyOTU1NDE2MTg=@out-p-h.> To: "Angie Morestead" <> EOT my @tos=( q(To:"Furry Marmot" <>) ,q(To:"Mr.Furry Marmot" <>) ,q(To:"pharmacy" <>) ,q(To:<>) ,q(To:<>) ); foreach my $to( @tos){ $text =~ s/To:.*/$to/; print "$text\n"; ###two condition version if ($text =~ m{ ^(?:From|To): \s* ".*Furry\s{1}Marmot" \s* <marmot\@furrytorium\.com> }mx || $text =~ m{ ^(?:From|To): \s* <marmot\@furrytorium\.com> }mx ) { print "### with two cond, ok, matched=#$&#\n"; }else { print "### with two cond, ng\n"; } ###one condition if ($text =~ m{ ^(?:From|To): #From: or To: \s* " #<---here Where are you gone? (?!.*Furry\s+Marmot) # [^"]* # " # \s+ # <marmot\@furrytorium\.com> }mx ){ print "### Spam!! ,matched=#$&#\n"; }else { print "### not Spam!!\n"; } print "\n\n"; }

    And why "To:<>" is not Spam...? I think I'am missing something... Anyway I should print out prelretut. thanks.

      You're confusing a couple of things. A negation class just means match something that is not this class of chars. [^abc]+ means match one or more of anything that isn't a, b, or c. It has nothing to do with backtracking.

      Read perlretut for sure, and see "Backtracking" in perlre. Generally it's not something you have to worry about unless you have a regex that's running really slowly.

      With regard to the match above, it's coded to look at a specific pattern of email -- and it's not all-inclusive -- it just determines whether it matches a pattern that *I* say is or is not spam.
      1 To: "Furry Marmot" <> 2 To: "Mr.Furry Marmot" <> 3 To: "pharmacy" <> 4 To: <> 5 To: <>

      According to my admittedly arbitrary rules, if the display name ("Furry Marmot") is consistent with the local-part of the address (the part before the @ sign: "marmot"), this is a valid address. Also, just the email address without the display name is fine.

      But the regex I wrote tests for something that doesn't match that pattern. It says, IF there is something between quotes, BUT that something doesn't include "Furry Marmot", AND the address is "<>", THEN it's spam. So the regex matches my definition of spam, failing on not-spam.

      So number 1 fails because the regex tests for 'not .+Furry Marmot' between quotes but finds 'Furry Marmot' followed by '<>'. The match fails, so it is not spam.

      Number 2 also fails because we're testing for 'not .+Furry Marmot' and 'Mr.Furry Marmot' actually is '.+Furry Marmot'. But 'pharmacy' is definitely 'not Furry Marmot'; it's followed by the marmot email address, so Number 3 is spam.

      Number 4 and 5 fail because the regex is looking for To: "something between quotes".... There are no quotes at all, so it fails quickly, and failure equals not-spam, so they're both not spam.

      Obviously one could come up with a much better regex than my off-the-cuff, narcisstic example. :-) I was thinking of a spam catcher I worked on a long time ago, that included a series of patterns to try matching against a header block. One of the patterns I remember is emails addressed to "Online Pharmacy", but with my address. Another pattern was emails from me, to me, which wouldn't happen with those particular email accounts. But you get into all kinds of issues like, "Would I send an email to myself?" For a lot of people, the answer might be yes. And what about something sent to "Subscriber" <>? Is that valid? And what do you do with "Funky Marmot" <>? Oooooo, Spam Assassin starts looking good very quickly.

        >Number 4 and 5 fail because the regex is looking for To: 
        >"something between quotes".... There are no quotes at all, 
        >so it fails quickly, and failure equals not-spam, so they're 
        >both not spam.
        I get it. No.5 is not spam for this regex filter. NO.4 and No.5 fails because simply double quotes doesn't exist! I was totally confused with your comment below.
        # The email address should be "Furry Marmot" <>, or just
        # Anything else is spam.
        I was thinking that there must be some magic to judge "just <>" as a Spam. But test script says it is not a Spam...??? Now it's clear.

        Anyway I should look perlre also. Thanks for kind explanation to a bad student. regards.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://884398]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (1)
As of 2018-04-24 23:26 GMT
Find Nodes?
    Voting Booth?