Hello Monks!
I got this case of an unefficient regexp handling when matching strings in a large file:
To look for ONE string takes 2 seconds while looking for TWO strings takes 79 seconds. Here is the code:
use strict;
use Benchmark;
my $file = shift || 'no_file';
timethese(
1,
{
'one_string' => sub { one_string() },
'two_string' => sub { two_string() },
}
);
sub one_string {
my $filter = '00901808';
my $re = qr/$filter/o;
my @matched;
open (my $FH, "<$file");
while (my $rec = <$FH>) {
if ( $rec =~ $re) {
push @matched, $rec;
}
}
close $FH;
}
sub two_string {
my $filter = '00901808|87654321';
my $re = qr/$filter/o;
my @matched;
open (my $FH, "<$file");
while (my $rec = <$FH>) {
if ( $rec =~ $re) {
push @matched, $rec;
}
}
close $FH;
}
__END__
The result says:
# perl bench_regexp 100000lines.92MB.file
Benchmark: timing 1 iterations of one_string, two_string...
one_string: 2 wallclock secs ( 1.68 usr + 0.42 sys = 2.10 CPU) @ 0
+.48/s (n=1)
(warning: too few iterations for a reliable count)
two_string: 77 wallclock secs (76.13 usr + 0.59 sys = 76.72 CPU) @ 0
+.01/s (n=1)
(warning: too few iterations for a reliable count)
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|