Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Looking for ideas on how to optimize this specialized grep

by furry_marmot (Pilgrim)
on Jan 22, 2011 at 00:39 UTC ( #883643=note: print w/ replies, xml ) Need Help??


in reply to Looking for ideas on how to optimize this specialized grep

I can think of a few things. First, read the header block all at once. That will reduce the number of reads to one per file.

Then anchor the regex on the beginning of a line. Specifically, only look on the To: line (I leave the Cc: line, as well as multiple addressees, as an excercise for later) and lose the /s modifier. The regex will fail immediately on any line that doesn't start with To:, only searching for email addresses on the rest of the To: line. This will reduce the number of matches to one per file.

Do a case-insensitive match, rather than lower casing the line. This will be one less operation per file.

#!/usr/bin/perl use strict; use warnings; use File::Find; use YAML::Syck; my %addresses; find(sub { return unless -f $_; open my $fh, '<', $_ or die; local $/ = ''; # "Paragraph" mode, reads a block of text to n +ext \n\n $_ = <$fh>; # Read Header block if (/To:.+\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/mi) { # /m to ancho +r, /i to ignore case my $addr = $1; # some addresses are in mailing list bounce format if ($addr =~ s/[=\#\%](?:3d)?/@/) { # No need for /xms her +e $addr =~ s/\@[^@]+$//; } $addresses{$addr}++; } close $fh; }, glob($ENV{HOME} . '/Maildir/.misc*')); print Dump \%addresses;
You're using the /x, /m, and /s modifiers already, but I'm not sure you understand how to use them. /x allows you to put in whitespace for clarity, requiring you to use \s to match whitespace.
/To:.+ \b ( andrew\+ [^\@\s]+ \@ [-a-z0-9\.]+ ) \b /mix

The \s is fine in your original code, but you're not actually using any whitespace for clarity, so /x is unneeded.

Same with /s, where . matches newlines as if they were whitespace. For example, in the text below, /upon.+little/s would find a match,

Once upon a time\n
there was a little
prince\n
named Lord Fancypants.\n

while /upon.+little/ would not because . won't match a newline without /s.

Finally, /m allows you to match ^ and $ on a "line" -- that is, after and before a newline embedded in a block of text. So /^there/ would not find a match, while /^there/m would.

Once upon a time\n
there was a little prince\n
named Lord Fancypants.\n

And /^there.+/m would match "there was a little prince", while /^there.+/ms would match

Once upon a time\n
there was a little prince\n
named Lord Fancypants.\n

Similarly, in your substitution,  $addr =~ s/[=\#\%](?:3d)?/@/xms  , you're not using whitespace for clarity, nor are you anchoring on the beginning or end of a line, nor should your match ever go over a newline. I'm not sure if it would make a performance difference, but you could just drop the /xms entirely, as I did above.

I hope this helps.

--marmot


Comment on Re: Looking for ideas on how to optimize this specialized grep
Select or Download Code
Re^2: Looking for ideas on how to optimize this specialized grep
by afresh1 (Hermit) on Jan 24, 2011 at 22:17 UTC

    So you think setting $/ = "\n\n" and ($headers) = <$fh> should be faster? Sounds very interesting.

    Then regex against $headers and anchor against specific headers. But, many addresses are "hidden" in the first received header because of mailing lists or other things, because of that I had looked at the entire header. Maybe /(?:^To:\s+|^CC:\s+|<)$address/ms?

    I lowercased the entire line because I thought that would be faster than a case insensitive regex.

    I generally use /xms on all my regexes as that is how I expect them to work, and if I add them, it doesn't hurt even if I don't use the feature. Is there a reason NOT to use /x and /s? Do they slow down the regex?

    Thank you for many things to try to figure out setting up a benchmarks for.

    l8rZ,
    --
    andrew

      So you think setting $/ = "\n\n" and ($headers) = <$fh> should be faster?

      It's $/ = '', not $/ = "\n\n". It's just the way it works. And setting it to $/ = undef will slurp in the whole file. Anyway, yes, I think one file read and one match will be a lot faster than a dozen or so reads and a dozen or so matches, depending on the particular header.

      Then regex against $headers and anchor against specific headers. But, many addresses are "hidden" in the first received header because of mailing lists or other things, because of that I had looked at the entire header. Maybe /(?:^To:\s+|^CC:\s+|<)$address/ms?

      Well, you have to adjust the regex to your needs, but as I said, it's one match that covers all the places the address could be...including the whole header block, if necessary, versus a bunch of reads and matches.

      I generally use /xms on all my regexes as that is how I expect them to work, and if I add them, it doesn't hurt even if I don't use the feature. Is there a reason NOT to use /x and /s? Do they slow down the regex?

      My feeling is that setting features you don't use as defaults is a bad practice. Programming is very much a thinking endeavor. Always setting /xms and can lead to some very nasty bugs when you forget what those options actually mean or that you have set them. For example, '+' is greedy. Once you have a basic match, like m/Start.+finish/s for example, this regex will search all the way to the end of the block of text and start working backwards to find 'finish'. Without the /s modifier, it only searches to the next newline to start working backwards.

      Similarly, /m just lets you match ^ and $ against embedded newlines. If you forget and search for m/^Something/m, you might get unexpected results. They are just tools. You can write code that always accommodates the use of those modifiers, but why? It's like deciding that you will always use a screwdriver, even when you don't need it. It's odd...

      Anyway, I stripped the code to its basics and benchmarked it. The one read/one match approach is about 30% faster than your approach, so there ya go. Also, I copied sub1 and sub3 as sub2 and sub4, and then changed the regex to use or not use /xms. Turns out sub4 runs about 2% more slowly than sub3, probably because the particular regex does a lot of backtracking. sub2, where I removed the /xms, runs about 10% more slowly! I've run it a few times, and it's consistent. I didn't expect that and don't understand it.

      Cheers!

      --marmot
      #!/usr/bin/perl use strict; use warnings; use File::Find; use Benchmark qw(:all) ; sub sub1 { my %addresses; open my $fh, '<test.eml' or die; while (<$fh>) { last if $_ eq "\n"; # only scan headers $_ = lc $_; if (/\b(johnqp\@mailserver\.com)/xms) { my $addr = $1; $addresses{$addr}++; } } close $fh; } sub sub2 { my %addresses; open my $fh, '<test.eml' or die; while (<$fh>) { last if $_ eq "\n"; # only scan headers $_ = lc $_; if (/\b(johnqp\@mailserver\.com)/) { my $addr = $1; $addresses{$addr}++; } } close $fh; } sub sub3 { my %addresses; open my $fh, '<test.eml' or die; local $/ = ''; $_ = <$fh>; if (/^(?:To|Cc):.+(johnqp\@mailserver\.com)/mi) { my $addr = $1; $addresses{$addr}++; } close $fh; } sub sub4 { my %addresses; open my $fh, '<test.eml' or die; local $/ = ''; $_ = <$fh>; if (/^(?:To|Cc):.+(johnqp\@mailserver\.com)/xsmi) { my $addr = $1; $addresses{$addr}++; } close $fh; } cmpthese(100000, { 'Linewise' => \&sub1, 'Line no /xms' => \&sub2, 'Blockwise' => \&sub3, 'Block /xms' => \&sub4, }); <STDIN>; __END__ Rate Line no /xms Linewise Block /xms Blockwis +e Linewise 3256/s 11% -- -22% -24 +% Line no /xms 2946/s -- -10% -30% -31 +% Blockwise 4282/s 45% 32% 2% - +- Block /xms 4198/s 43% 29% -- -2 +%

        I also tried a few different things and it seems that using or not /x, /m or /s depends entirely on whether you want them, the speed difference is negligible.

        I was amazed how much speed difference there was in /i! The amount of slowdown from using it instead of turning the entire string lowercase was amazing.

        What made the most difference was not repeating myself, reading the entire header at once and also matching the regex against the header just once. Grabbing only the one match out of the headers (no /g) also sped things up quite a bit and although it may miss something, in this case not very likely, Don't do more than you have to. I will use this in the future and just that made this entire thread worth it to me.

        As far as setting /xms by default, it makes perlcritic happy (and critic complains if a regex doesn't have them, nothing except reading the code tells me that they do) and I have had as much strangeness when I expected whitespace not to match but forgot to set /s, so errors can happen either way.

        One thing I was unsure of, when you talk of "anchoring" your examples seem like that means using ^ or $ when I would think any static string (like "andrew\+") would count as an anchor in a regex?

        I also included just reading the headers in both ways to show how much of the time was taken by file access and it was significant.

        
                         Rate Case_Ins No_X  Old No_M No_S No_XMS No_Strip 1_Match 1_Read 1_Read_1_Match Just_Read Just_1_Read
        Case_Ins       23.2/s       -- -44% -45% -45% -45%   -45%     -46%    -51%   -55%           -58%      -65%        -69%
        No_X           41.7/s      80%   --  -1%  -1%  -1%    -1%      -2%    -11%   -19%           -25%      -37%        -44%
        Old            42.0/s      81%   1%   --  -0%  -0%    -0%      -2%    -11%   -18%           -24%      -37%        -43%
        No_M           42.0/s      81%   1%   0%   --  -0%    -0%      -2%    -11%   -18%           -24%      -37%        -43%
        No_S           42.2/s      82%   1%   0%   0%   --    -0%      -1%    -10%   -18%           -24%      -37%        -43%
        No_XMS         42.2/s      82%   1%   0%   0%   0%     --      -1%    -10%   -18%           -24%      -37%        -43%
        No_Strip       42.7/s      84%   3%   2%   2%   1%     1%       --     -9%   -17%           -23%      -36%        -42%
        1_Match        46.9/s     102%  13%  12%  12%  11%    11%      10%      --    -8%           -15%      -30%        -37%
        1_Read         51.3/s     121%  23%  22%  22%  22%    22%      20%      9%     --            -7%      -23%        -31%
        1_Read_1_Match 55.2/s     138%  33%  31%  31%  31%    31%      29%     18%     8%             --      -17%        -25%
        Just_Read      66.7/s     187%  60%  59%  59%  58%    58%      56%     42%    30%            21%        --        -10%
        Just_1_Read    74.1/s     219%  78%  76%  76%  76%    76%      73%     58%    44%            34%       11%          --
        
        #!/usr/bin/perl use strict; use warnings; use File::Find; use Benchmark qw/ cmpthese /; my %ad; my @dirs = glob( $ENV{HOME} . '/Maildir/.misc.d*' ); cmpthese( 100, { Old => sub { %ad = (); find( \&old, @dirs ) +; }, Case_Ins => sub { %ad = (); find( \&case_insensitive, @dirs ) + }, '1_Read' => sub { %ad = (); find( \&one_read, @dirs ) + }, '1_Match' => sub { %ad = (); find( \&one_match, @dirs ) + }, '1_Read_1_Match' => sub { %ad = (); find( \&one_read_one_match, @dirs ) }, No_XMS => sub { %ad = (); find( \&no_xms, @dirs ) +}, No_X => sub { %ad = (); find( \&no_x, @dirs ) +}, No_M => sub { %ad = (); find( \&no_m, @dirs ) +}, No_S => sub { %ad = (); find( \&no_s, @dirs ) +}, No_Strip => sub { %ad = (); find( \&no_strip, @dirs ) +}, Just_Read => sub { %ad = (); find( \&just_read, @dirs ) +}, Just_1_Read => sub { %ad = (); find( \&just_one_read, @dirs ) +}, } ); sub old { return unless -f $_; open my $fh, '<', $_ or die; while (<$fh>) { last if $_ eq "\n"; $_ = lc $_; if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xms) { my $addr = $1; if ( $addr =~ s/[=\#\%](?:3d)?/@/xms ) { $addr =~ s/\@[^@]+$//xms; } $ad{$addr}++; } } close $fh; } sub no_xms { return unless -f $_; open my $fh, '<', $_ or die; while (<$fh>) { last if $_ eq "\n"; $_ = lc $_; if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/) { my $addr = $1; if ( $addr =~ s/[=\#\%](?:3d)?/@/ ) { $addr =~ s/\@[^@]+$//; } $ad{$addr}++; } } close $fh; } sub no_s { return unless -f $_; open my $fh, '<', $_ or die; while (<$fh>) { last if $_ eq "\n"; $_ = lc $_; if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xm) { my $addr = $1; if ( $addr =~ s/[=\#\%](?:3d)?/@/xm ) { $addr =~ s/\@[^@]+$//xm; } $ad{$addr}++; } } close $fh; } sub no_m { return unless -f $_; open my $fh, '<', $_ or die; while (<$fh>) { last if $_ eq "\n"; $_ = lc $_; if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xs) { my $addr = $1; if ( $addr =~ s/[=\#\%](?:3d)?/@/xs ) { $addr =~ s/\@[^@]+$//xs; } $ad{$addr}++; } } close $fh; } sub no_x { return unless -f $_; open my $fh, '<', $_ or die; while (<$fh>) { last if $_ eq "\n"; $_ = lc $_; if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/ms) { my $addr = $1; if ( $addr =~ s/[=\#\%](?:3d)?/@/ms ) { $addr =~ s/\@[^@]+$//ms; } $ad{$addr}++; } } close $fh; } sub no_strip { return unless -f $_; open my $fh, '<', $_ or die; while (<$fh>) { last if $_ eq "\n"; $_ = lc $_; if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xms) { $ad{$1}++; } } close $fh; } sub case_insensitive { return unless -f $_; open my $fh, '<', $_ or die; while (<$fh>) { last if $_ eq "\n"; if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/ixms) { my $addr = $1; if ( $addr =~ s/[=\#\%](?:3[Dd])?/@/xms ) { $addr =~ s/\@[^@]+$//xms; } $ad{$addr}++; } } close $fh; } sub one_read { return unless -f $_; open my $fh, '<', $_ or die; local $/ = ''; $_ = lc <$fh>; close $fh; foreach my $addr (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/gxms) { if ( $addr =~ s/[=\#\%](?:3d)?/@/xms ) { $addr =~ s/\@[^@]+$//xms; } $ad{$addr}++; } } sub one_match { return unless -f $_; my $headers = ''; open my $fh, '<', $_ or die; while (<$fh>) { last if $_ eq "\n"; $headers .= lc $_; } close $fh; if ( $headers =~ /\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xms ) { my $addr = $1; if ( $addr =~ s/[=\#\%](?:3d)?/@/xms ) { $addr =~ s/\@[^@]+$//xms; } $ad{$addr}++; } } sub one_read_one_match { return unless -f $_; open my $fh, '<', $_ or die; local $/ = ''; $_ = lc <$fh>; close $fh; if ( my ($addr) = /\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xms ) { if ( $addr =~ s/[=\#\%](?:3d)?/@/xms ) { $addr =~ s/\@[^@]+$//xms; } $ad{$addr}++; } } sub just_read { return unless -f $_; open my $fh, '<', $_ or die; while (<$fh>) { last if $_ eq "\n"; } close $fh; } sub just_one_read { return unless -f $_; open my $fh, '<', $_ or die; local $/ = ''; $_ = <$fh>; close $fh; }
        l8rZ,
        --
        andrew
Re^2: Looking for ideas on how to optimize this specialized grep
by remiah (Hermit) on Jan 25, 2011 at 02:41 UTC
    I didn't understand s and m switch of regex until furry_marmot's explanation... man perlre says about /ms

    'let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string'

    I didn't think of example that needs this. Do you have any example case like 'little princess' example for /ms?

    As for block mode of this example, I saw this way in awk script. I first met this way($\='') in perl.

    People sometimes say regex is slow, so I tried to use index function insted of regex. But it seems not improving time. I simplified just to pick up From address in this example and index version needs utf8 treatment for index and substr.
    use strict; use warnings; use File::Find; use Data::Dumper; my %addresses; sub test1 { my ($from); find(sub { return unless -f $_; open my $fh, '<', $_ or die; local $/ = ''; # "Paragraph" mode, reads a block of t +ext to next \n\n $_ = <$fh>; # Read Header block ($from)= $_ =~ /^From:(.*)/m; # /m to anchor #print "$from\n"; close $fh; }, glob('./009_mailtest/*')); #print Dumper \%addresses; } sub test2{ binmode(STDOUT,":utf8"); my ($from,$bgn,$end,$len); find(sub { return unless -f $_; open my $fh, '<:utf8', $_ or die; local $/ = ''; # "Paragraph" mode, reads a block of t +ext to next \n\n $_ = <$fh>; # Read Header block $bgn=index($_,"From:",0) + length("From:"); $end=index($_,chr(10),$bgn+1); $len=$end - $bgn; $from=substr($_, $bgn, $len); #print "$from\n"; close $fh; }, glob('./009_mailtest/*')); } my($start,$end); $start=(times)[0]; &test1; $end=(times)[0]; print "with regex=" . ($end - $start) . "sec\n"; $start=(times)[0]; &test2; $end=(times)[0]; print "without regex=" . ($end - $start) . "sec\n";
    The result for my 319Mb test mail box was like this.
    with regex=0.296875sec
    without regex=0.34375sec
    

      >> I didn't understand s and m switch of regex until furry_marmot's explanation...

      Thanks. Actually, they confused me for a long time when I was first learning Perl. I finally got it when I read Jeffrey Friedl's Mastering Regular Expressions; but I've always found a good example goes a loooong way.

      Some things to remember:

      1. .+ and .* are greedy. They look as far forward as they can and then work backwards to find the largest match possible (see example below).
      2. .+? and .*? are not greedy. They search forward from the current string position to find the earliest match possible. These are slower (I forget by how much), but sometimes they are what you need.
      3. /s allows '.' to match newlines, so .+ will look all the way to the end of whatever you're searching, whether it's a few characters, or several Kb of text, and then starts working backwards. Without /s, it only looks to the next newline to start looking back.
      4. /m is shorthand for (though not quite identical to) anchoring on a newline, but it can be useful to think of embedded lines in a block of text instead of thinking of a bunch of text and newlines all jumbled together.

      >> Do you have any example case like 'little prince' example for /ms?

      Sure. Here's an email header I pulled out of my spam catcher, with a bunch of regexes to illustrate.
      $text = <<'EOT'; Message-ID: <ODM2bWFpbGVyLmRpZWJlYS40MjYyNjE2LjEyOTU1NDE2MTg=@out-p-h. +customernews.net> From: "GenericOnline Pharmacy" <marmot@furrytorium.com> To: "Angie Morestead" <marmot@furrytorium.com> Subject: Buy drugs online now! Date: Thu, 20 Jan 2011 18:40:18 +0200 Content-Type: multipart/related; boundary="----=_Weigard_drugs_CG_0" EOT $text =~ /^Subject:.+drugs/m; # Anchor just after \n, before Subject. # Matches 'Subject: Buy drugs' $text =~ /\nSubject:.+drugs/; # Equivalent $text =~ /^Subject:.+drugs/ms; # '.' matches newlines, all the way to # '..._Weigard_drugs', which is not wh +at we wanted. $text =~ /^Subject:.+?drugs/ms; # '.' matches newlines, but searches f +rom current string # position, stopping when it matches ' +Subject: Buy drugs'. # This is a little slower than the fir +st two, but # equivalent. /s is countered by the . ++?, but if 'drugs' # was not in the Subject line, the reg +ex would keep keep # on going. # Here are some fun ones. # The email address should be "Furry Marmot" <marmot@furrytorium.com>, + or just # marmot@furrytorium.com. Anything else is spam. print "Spam!!!\n" if $text =~ /^(?:From|To):\s*"(?!.+Furry Marmot)[^"]*" <marmot\@fu +rrytorium\.com>/m; # Regarding the [^"]*, if the regex finds Furry Marmot in quotes, it f +ails and this isn't # spam. But if it finds something else, we still have to match somethi +ng between the # quotes, and then match the email to determine if it is spam. # I should never see anything from me, to me. print "Spam!!!\n" if $text =~ /(?=^From:[^\n]+marmot\@furrytorium\.com).+^To:[^\n]+marm +ot\@furrytorium\.com/ms; # This starts at the beginning of header block, finds From: line with +my email address, # resets to start of block (because of zero-width lookahead assertion) +, then finds To: # line with my email address. It is the equivalent of... if ($text =~ /^From:.+marmot\@furrytorium\.com)/m && /^To:.+marmot\@fu +rrytorium\.com/m) { print "Spam!!!\n" } # ...but I can include the single pattern in a list of patterns that I + might want to match # against the string.

      >> People sometimes say regex is slow

      It depends on how it's used. The regex engine is actually pretty quick, but there are certain things that can really slow it down. It's been a while since I read Friedl's book, but basically the search engine looks for the start of a pattern, and then tries to find the rest. If the rest is not there, it backs out of what it was able to match and goes looking again.

      So just searching for /^From:.+marmot/m, it will first look for the beginning of the text, and then look at each character for a newline. Once it has that, it looks to see if the next character is an 'F'. If not, it backtracks and searches for the next newline. Once it finds 'From:', it looks again for a newline (because we're not using /s), and works back to see if it can find 'marmot'. If not, it backs out of the 'From:' it has matched so far and goes looking for another 'From:' line.

      More complex searches can cause it to backtrack up a storm. But a well-constructed regex can minimize that. Index is probably faster at searching for plaintext, but it can't search for patterns, which limits its usefulness.

      --marmot
        It took me long time to understand backtracking supress ? and regex like '[^"]*' to supress backtracking, so extended regex will take some time for me. Your example
        "Spam!!!\n" if $text =~ /^(?:From|To):\s*"(?!.+Furry Marmot)[^"]*" <marmot\@furryt +orium\.com>/m;
        is greek for me now, but sometime I will understand extend regex.
        print "Spam!!!\n" if $text =~ /(?=^From:[^\n]+marmot\@furrytorium\.com).+^To:[^\n]marmo +t\@furrytorium\.com/ms;
        this example of /ms and your explanation give me a clue for what is "zero width"

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://883643]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (9)
As of 2014-07-14 07:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (256 votes), past polls