Re^2: Looking for ideas on how to optimize this specialized grep

So you think setting $/ = "\n\n" and ($headers) = <$fh> should be faster?

It's $/ = '', not $/ = "\n\n". It's just the way it works. And setting it to $/ = undef will slurp in the whole file. Anyway, yes, I think one file read and one match will be a lot faster than a dozen or so reads and a dozen or so matches, depending on the particular header.

Then regex against $headers and anchor against specific headers. But, many addresses are "hidden" in the first received header because of mailing lists or other things, because of that I had looked at the entire header. Maybe /(?:^To:\s+|^CC:\s+|<)$address/ms?

Well, you have to adjust the regex to your needs, but as I said, it's one match that covers all the places the address could be...including the whole header block, if necessary, versus a bunch of reads and matches.

I generally use /xms on all my regexes as that is how I expect them to work, and if I add them, it doesn't hurt even if I don't use the feature. Is there a reason NOT to use /x and /s? Do they slow down the regex?

My feeling is that setting features you don't use as defaults is a bad practice. Programming is very much a thinking endeavor. Always setting /xms and can lead to some very nasty bugs when you forget what those options actually mean or that you have set them. For example, '+' is greedy. Once you have a basic match, like m/Start.+finish/s for example, this regex will search all the way to the end of the block of text and start working backwards to find 'finish'. Without the /s modifier, it only searches to the next newline to start working backwards.

Similarly, /m just lets you match ^ and $ against embedded newlines. If you forget and search for m/^Something/m, you might get unexpected results. They are just tools. You can write code that always accommodates the use of those modifiers, but why? It's like deciding that you will always use a screwdriver, even when you don't need it. It's odd...

Cheers!

#!/usr/bin/perl 
use strict;
use warnings;
use File::Find;
use Benchmark qw(:all) ;

sub sub1 {
    my %addresses;

    open my $fh, '<test.eml' or die;
    while (<$fh>) {
        last if $_ eq "\n"; # only scan headers
        $_ = lc $_;

        if (/\b(johnqp\@mailserver\.com)/xms) {
            my $addr = $1;
            $addresses{$addr}++;
        }

    }
    close $fh;
}


sub sub2 {
    my %addresses;

    open my $fh, '<test.eml' or die;
    while (<$fh>) {
        last if $_ eq "\n"; # only scan headers
        $_ = lc $_;

        if (/\b(johnqp\@mailserver\.com)/) {
            my $addr = $1;
            $addresses{$addr}++;
        }

    }
    close $fh;
}

sub sub3 {
    my %addresses;

    open my $fh, '<test.eml' or die;
    local $/ = '';
    $_ = <$fh>;
        if (/^(?:To|Cc):.+(johnqp\@mailserver\.com)/mi) {
        my $addr = $1;
        $addresses{$addr}++;
    }
    close $fh;
}


sub sub4 {
    my %addresses;

    open my $fh, '<test.eml' or die;
    local $/ = '';
    $_ = <$fh>;
        if (/^(?:To|Cc):.+(johnqp\@mailserver\.com)/xsmi) {
        my $addr = $1;
        $addresses{$addr}++;
    }
    close $fh;
}


cmpthese(100000, {
    'Linewise'  => \&sub1,
    'Line no /xms'  => \&sub2,
    'Blockwise' => \&sub3,
    'Block /xms' => \&sub4,
});

<STDIN>;

__END__

               Rate Line no /xms     Linewise   Block /xms    Blockwis
+e
Linewise     3256/s          11%           --         -22%         -24
+%
Line no /xms 2946/s           --         -10%         -30%         -31
+%
Blockwise    4282/s          45%          32%           2%           -
+-
Block /xms   4198/s          43%          29%           --          -2
+%
[download]

[reply]
[d/l]
[select]

I also tried a few different things and it seems that using or not /x, /m or /s depends entirely on whether you want them, the speed difference is negligible.

I was amazed how much speed difference there was in /i! The amount of slowdown from using it instead of turning the entire string lowercase was amazing.

What made the most difference was not repeating myself, reading the entire header at once and also matching the regex against the header just once. Grabbing only the one match out of the headers (no /g) also sped things up quite a bit and although it may miss something, in this case not very likely, Don't do more than you have to. I will use this in the future and just that made this entire thread worth it to me.

As far as setting /xms by default, it makes perlcritic happy (and critic complains if a regex doesn't have them, nothing except reading the code tells me that they do) and I have had as much strangeness when I expected whitespace not to match but forgot to set /s, so errors can happen either way.

One thing I was unsure of, when you talk of "anchoring" your examples seem like that means using ^ or $ when I would think any static string (like "andrew\+") would count as an anchor in a regex?

I also included just reading the headers in both ways to show how much of the time was taken by file access and it was significant.


                 Rate Case_Ins No_X  Old No_M No_S No_XMS No_Strip 1_Match 1_Read 1_Read_1_Match Just_Read Just_1_Read
Case_Ins       23.2/s       -- -44% -45% -45% -45%   -45%     -46%    -51%   -55%           -58%      -65%        -69%
No_X           41.7/s      80%   --  -1%  -1%  -1%    -1%      -2%    -11%   -19%           -25%      -37%        -44%
Old            42.0/s      81%   1%   --  -0%  -0%    -0%      -2%    -11%   -18%           -24%      -37%        -43%
No_M           42.0/s      81%   1%   0%   --  -0%    -0%      -2%    -11%   -18%           -24%      -37%        -43%
No_S           42.2/s      82%   1%   0%   0%   --    -0%      -1%    -10%   -18%           -24%      -37%        -43%
No_XMS         42.2/s      82%   1%   0%   0%   0%     --      -1%    -10%   -18%           -24%      -37%        -43%
No_Strip       42.7/s      84%   3%   2%   2%   1%     1%       --     -9%   -17%           -23%      -36%        -42%
1_Match        46.9/s     102%  13%  12%  12%  11%    11%      10%      --    -8%           -15%      -30%        -37%
1_Read         51.3/s     121%  23%  22%  22%  22%    22%      20%      9%     --            -7%      -23%        -31%
1_Read_1_Match 55.2/s     138%  33%  31%  31%  31%    31%      29%     18%     8%             --      -17%        -25%
Just_Read      66.7/s     187%  60%  59%  59%  58%    58%      56%     42%    30%            21%        --        -10%
Just_1_Read    74.1/s     219%  78%  76%  76%  76%    76%      73%     58%    44%            34%       11%          --

#!/usr/bin/perl
use strict;
use warnings;

use File::Find;
use Benchmark qw/ cmpthese /;

my %ad;
my @dirs = glob( $ENV{HOME} . '/Maildir/.misc.d*' );

cmpthese(
    100,
    {   Old       => sub { %ad = (); find( \&old,              @dirs )
+; },
        Case_Ins  => sub { %ad = (); find( \&case_insensitive, @dirs )
+ },
        '1_Read'  => sub { %ad = (); find( \&one_read,         @dirs )
+ },
        '1_Match' => sub { %ad = (); find( \&one_match,        @dirs )
+ },
        '1_Read_1_Match' =>
            sub { %ad = (); find( \&one_read_one_match, @dirs ) },
        No_XMS      => sub { %ad = (); find( \&no_xms,        @dirs ) 
+},
        No_X        => sub { %ad = (); find( \&no_x,          @dirs ) 
+},
        No_M        => sub { %ad = (); find( \&no_m,          @dirs ) 
+},
        No_S        => sub { %ad = (); find( \&no_s,          @dirs ) 
+},
        No_Strip    => sub { %ad = (); find( \&no_strip,      @dirs ) 
+},
        Just_Read   => sub { %ad = (); find( \&just_read,     @dirs ) 
+},
        Just_1_Read => sub { %ad = (); find( \&just_one_read, @dirs ) 
+},
    }
);

sub old {
    return unless -f $_;

    open my $fh, '<', $_ or die;
    while (<$fh>) {
        last if $_ eq "\n";
        $_ = lc $_;

        if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xms) {
            my $addr = $1;
            if ( $addr =~ s/[=\#\%](?:3d)?/@/xms ) {
                $addr =~ s/\@[^@]+$//xms;
            }
            $ad{$addr}++;
        }

    }
    close $fh;
}

sub no_xms {
    return unless -f $_;

    open my $fh, '<', $_ or die;
    while (<$fh>) {
        last if $_ eq "\n";
        $_ = lc $_;

        if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/) {
            my $addr = $1;
            if ( $addr =~ s/[=\#\%](?:3d)?/@/ ) {
                $addr =~ s/\@[^@]+$//;
            }
            $ad{$addr}++;
        }

    }
    close $fh;
}

sub no_s {
    return unless -f $_;

    open my $fh, '<', $_ or die;
    while (<$fh>) {
        last if $_ eq "\n";
        $_ = lc $_;

        if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xm) {
            my $addr = $1;
            if ( $addr =~ s/[=\#\%](?:3d)?/@/xm ) {
                $addr =~ s/\@[^@]+$//xm;
            }
            $ad{$addr}++;
        }

    }
    close $fh;
}

sub no_m {
    return unless -f $_;

    open my $fh, '<', $_ or die;
    while (<$fh>) {
        last if $_ eq "\n";
        $_ = lc $_;

        if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xs) {
            my $addr = $1;
            if ( $addr =~ s/[=\#\%](?:3d)?/@/xs ) {
                $addr =~ s/\@[^@]+$//xs;
            }
            $ad{$addr}++;
        }

    }
    close $fh;
}

sub no_x {
    return unless -f $_;

    open my $fh, '<', $_ or die;
    while (<$fh>) {
        last if $_ eq "\n";
        $_ = lc $_;

        if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/ms) {
            my $addr = $1;
            if ( $addr =~ s/[=\#\%](?:3d)?/@/ms ) {
                $addr =~ s/\@[^@]+$//ms;
            }
            $ad{$addr}++;
        }

    }
    close $fh;
}

sub no_strip {
    return unless -f $_;

    open my $fh, '<', $_ or die;
    while (<$fh>) {
        last if $_ eq "\n";
        $_ = lc $_;

        if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xms) {
            $ad{$1}++;
        }

    }
    close $fh;
}

sub case_insensitive {
    return unless -f $_;

    open my $fh, '<', $_ or die;
    while (<$fh>) {
        last if $_ eq "\n";

        if (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/ixms) {
            my $addr = $1;
            if ( $addr =~ s/[=\#\%](?:3[Dd])?/@/xms ) {
                $addr =~ s/\@[^@]+$//xms;
            }
            $ad{$addr}++;
        }

    }
    close $fh;
}

sub one_read {
    return unless -f $_;

    open my $fh, '<', $_ or die;
    local $/ = '';
    $_ = lc <$fh>;
    close $fh;

    foreach my $addr (/\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/gxms) {
        if ( $addr =~ s/[=\#\%](?:3d)?/@/xms ) {
            $addr =~ s/\@[^@]+$//xms;
        }
        $ad{$addr}++;
    }
}

sub one_match {
    return unless -f $_;

    my $headers = '';
    open my $fh, '<', $_ or die;
    while (<$fh>) {
        last if $_ eq "\n";
        $headers .= lc $_;
    }
    close $fh;

    if ( $headers =~ /\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xms ) {
        my $addr = $1;
        if ( $addr =~ s/[=\#\%](?:3d)?/@/xms ) {
            $addr =~ s/\@[^@]+$//xms;
        }
        $ad{$addr}++;
    }
}

sub one_read_one_match {
    return unless -f $_;

    open my $fh, '<', $_ or die;
    local $/ = '';
    $_ = lc <$fh>;
    close $fh;

    if ( my ($addr) = /\b(andrew\+[^\@\s]+\@[-a-z0-9\.]+)\b/xms ) {
        if ( $addr =~ s/[=\#\%](?:3d)?/@/xms ) {
            $addr =~ s/\@[^@]+$//xms;
        }
        $ad{$addr}++;
    }
}

sub just_read {
    return unless -f $_;

    open my $fh, '<', $_ or die;
    while (<$fh>) {
        last if $_ eq "\n";
    }
    close $fh;
}

sub just_one_read {
    return unless -f $_;

    open my $fh, '<', $_ or die;
    local $/ = '';
    $_ = <$fh>;
    close $fh;
}
[download]

l8rZ,
--
andrew

[reply]
[d/l]

>> One thing I was unsure of, when you talk of "anchoring" your examples seem like that means using ^ or $ when I would think any static string (like "andrew\+") would count as an anchor in a regex?

Yes, absolutely. ^ and $ are common, but anything solid that isn't a pattern, e.g. both 'a' and 'b' in /a.+b/, could be called anchors. Also \G is an anchor when combined with /g, in a loop.

[reply]
[d/l]
[select]


We don't bite newbies here... much
	PerlMonks