note
furry_marmot
<p><i>So you think setting $/ = "\n\n" and ($headers) = <$fh> should be faster? </i></p>
<p>It's <c>$/ = ''</c>, not <c>$/ = "\n\n"</c>. It's just the way it works. And setting it to <c>$/ = undef</c> will slurp in the whole file. Anyway, yes, I think one file read and one match will be a lot faster than a dozen or so reads and a dozen or so matches, depending on the particular header.</p>
<p><i>Then regex against $headers and anchor against specific headers. But, many addresses are "hidden" in the first received header because of mailing lists or other things, because of that I had looked at the entire header. Maybe /(?:^To:\s+|^CC:\s+|<)$address/ms?</i></p>
<p>Well, you have to adjust the regex to your needs, but as I said, it's one match that covers all the places the address could be...including the whole header block, if necessary, versus a bunch of reads and matches.</p>
<p><i>I generally use /xms on all my regexes as that is how I expect them to work, and if I add them, it doesn't hurt even if I don't use the feature. Is there a reason NOT to use /x and /s? Do they slow down the regex?</i></p>
<p>My feeling is that setting features you don't use as defaults is a bad practice. Programming is very much a thinking endeavor. Always setting /xms and can lead to some very nasty bugs when you forget what those options actually mean or that you have set them. For example, '+' is greedy. Once you have a basic match, like <c>m/Start.+finish/s</c> for example, this regex will search all the way to the end of the block of text and start working backwards to find 'finish'. Without the /s modifier, it only searches to the next newline to start working backwards.</p>
<p>Similarly, /m just lets you match ^ and $ against embedded newlines. If you forget and search for <c>m/^Something/m</c>, you might get unexpected results. They are just tools. You can write code that always accommodates the use of those modifiers, but why? It's like deciding that you will always use a screwdriver, even when you don't need it. It's odd...</p>
Anyway, I stripped the code to its basics and benchmarked it. The one read/one match approach is about 30% faster than your approach, so there ya go. Also, I copied sub1 and sub3 as sub2 and sub4, and then changed the regex to use or not use /xms. Turns out sub4 runs about 2% more slowly than sub3, probably because the particular regex does a lot of backtracking. sub2, where I removed the /xms, runs about 10% more slowly! I've run it a few times, and it's consistent. I didn't expect that and don't understand it.
<p>Cheers!</p>
--marmot
<c>
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use Benchmark qw(:all) ;
sub sub1 {
my %addresses;
open my $fh, '<test.eml' or die;
while (<$fh>) {
last if $_ eq "\n"; # only scan headers
$_ = lc $_;
if (/\b(johnqp\@mailserver\.com)/xms) {
my $addr = $1;
$addresses{$addr}++;
}
}
close $fh;
}
sub sub2 {
my %addresses;
open my $fh, '<test.eml' or die;
while (<$fh>) {
last if $_ eq "\n"; # only scan headers
$_ = lc $_;
if (/\b(johnqp\@mailserver\.com)/) {
my $addr = $1;
$addresses{$addr}++;
}
}
close $fh;
}
sub sub3 {
my %addresses;
open my $fh, '<test.eml' or die;
local $/ = '';
$_ = <$fh>;
if (/^(?:To|Cc):.+(johnqp\@mailserver\.com)/mi) {
my $addr = $1;
$addresses{$addr}++;
}
close $fh;
}
sub sub4 {
my %addresses;
open my $fh, '<test.eml' or die;
local $/ = '';
$_ = <$fh>;
if (/^(?:To|Cc):.+(johnqp\@mailserver\.com)/xsmi) {
my $addr = $1;
$addresses{$addr}++;
}
close $fh;
}
cmpthese(100000, {
'Linewise' => \&sub1,
'Line no /xms' => \&sub2,
'Blockwise' => \&sub3,
'Block /xms' => \&sub4,
});
<STDIN>;
__END__
Rate Line no /xms Linewise Block /xms Blockwise
Linewise 3256/s 11% -- -22% -24%
Line no /xms 2946/s -- -10% -30% -31%
Blockwise 4282/s 45% 32% 2% --
Block /xms 4198/s 43% 29% -- -2%
</c>
883603
884011