Re: You don't always have to use regexes
by kvale (Monsignor) on Feb 23, 2005 at 16:26 UTC
|
Using the simplest op that gets the job done is always good advice, both for speed and readability.
But for those that are addicted to regexes, the above situation won't bite speed too hard. The regex engine optimizes a fixed string to a Boyer Moore match, which is a tad slower than string equality:
use Benchmark qw(:all) ;
my $value = 'FALSE';
my $count = 10_000_000;
cmpthese($count, {
'regex' => sub { $value =~ /^true$/i },
'eq' => sub { lc $value eq "true" },
});
yields
Benchmark: timing 10000000 iterations of eq, regex...
1048% perl boyer.pl
Benchmark: timing 10000000 iterations of eq, regex...
eq: 9 wallclock secs ( 8.98 usr + 0.00 sys = 8.98 CPU) @ 11
+13585.75/s (n=10000000)
regex: 16 wallclock secs (16.31 usr + 0.00 sys = 16.31 CPU) @ 61
+3120.78/s (n=10000000)
Rate regex eq
regex 613121/s -- -45%
eq 1113586/s 82% --
Unless that match is inside a tight loop, program performance will not be too degraded,
| [reply] [d/l] [select] |
Re: You don't always have to use regexes
by spurperl (Priest) on Feb 23, 2005 at 16:02 UTC
|
It's quite interesting to time this and see just how much performance is gained. Additionally, I'm curious whether the regex engine has, or planned to have optimizations on "static" expressions like this ?
Additionally, usage of substr can save quite a few regular expressions here and there.
But the rule of thumb should be: use whatever seems more natural for the problem at hand, and optimize only if necessary. | [reply] |
Re: You don't always have to use regexes
by VSarkiss (Monsignor) on Feb 23, 2005 at 16:22 UTC
|
Overuse of regexes is one of my favorite pet peeves also.
As you point out, eq will sometimes do everything you need. Other times all you need is index. For example, if your regex didn't have anchors:
if ( $value =~ /true/i )
You could write instead
if ( index( lc $value, "true" ) >= 0 )
| [reply] [d/l] [select] |
|
I think that for the index case the situation is not so clear. Both the regex engine and index() will use the same Boyer-Moore routine and for me personally, the regex version is more readable. But as always, YMMV.
use Benchmark qw(:all) ;
my $value = 'FALSE';
my $count = 1_000_000;
cmpthese($count, {
'regex' => sub { $value =~ /^true$/i },
'eq' => sub { lc $value eq "true" },
'index' => sub { index( lc $value, "true" ) >= 0 },
});
yields
Benchmark: timing 1000000 iterations of eq, index, regex...
eq: 1 wallclock secs ( 0.89 usr + 0.00 sys = 0.89 CPU) @ 11
+23595.51/s (n=1000000)
index: 2 wallclock secs ( 1.65 usr + 0.00 sys = 1.65 CPU) @ 60
+6060.61/s (n=1000000)
regex: 2 wallclock secs ( 1.63 usr + 0.00 sys = 1.63 CPU) @ 61
+3496.93/s (n=1000000)
Rate index regex eq
index 606061/s -- -1% -46%
regex 613497/s 1% -- -45%
eq 1123596/s 85% 83% --
Update: As AM has pointed out (thank you!), the benchmark above has a bug. Using the tests
'regex' => sub { $value =~ /true/i },
'regex_anch' => sub { $value =~ /^true$/i },
'eq' => sub { lc $value eq "true" },
'index' => sub { index( lc $value, "true" ) >= 0 },
I get the results
Benchmark: timing 1000000 iterations of eq, index, regex, regex_anch..
+.
eq: 1 wallclock secs ( 0.88 usr + 0.00 sys = 0.88 CPU) @ 11
+36363.64/s (n=1000000)
index: 0 wallclock secs ( 1.65 usr + 0.00 sys = 1.65 CPU) @ 60
+6060.61/s (n=1000000)
regex: 0 wallclock secs ( 1.08 usr + 0.00 sys = 1.08 CPU) @ 92
+5925.93/s (n=1000000)
regex_anch: 2 wallclock secs ( 1.59 usr + 0.00 sys = 1.59 CPU) @ 62
+8930.82/s (n=1000000)
Rate index regex_anch regex eq
index 606061/s -- -4% -35% -47%
regex_anch 628931/s 4% -- -32% -45%
regex 925926/s 53% 47% -- -19%
eq 1136364/s 87% 81% 23% --
with the surprising result that the regex w/o the anchor is faster than the anchored version. Multiple runs yield similar results. As the AM says, one could try many different regex-value combos, but I expect the results to be not far different, precisely because both index and regex engine use the same BM function.
| [reply] [d/l] [select] |
|
You Benchmark is significantly flawed for the question asked. The OR (original replier) wanted to compare index(lc $value,"true") with $value =~ /true/i; In addition, to fairly benchmark one should try multiple test case (set $value to "true", a short string, and a longer string in your test, and in a fair test, set it to: 'true', 'ashortstringthentrue', 'averylongstringthentrue', and different size strings without 'true' in them.
| [reply] [d/l] [select] |
|
I benchmarked this and it yields an interesting result. index() is (a bit) faster than a regex. If itīs used in combination with lc(), as in your example, the regex with the i-modifier is faster.
use strict;
use warnings;
use Benchmark;
my $value = "somewhere here true is there!";
timethese
(
9000000,
{
'index' => sub { index( $value, "true" ) },
'regex' => sub { $value =~ /true/ },
}
);
timethese
(
9000000,
{
'index' => sub { index( lc $value, "true" ) },
'regex' => sub { $value =~ /true/i },
}
);
Benchmark: timing 9000000 iterations of index, regex...
index: 2 wallclock secs ( 2.02 usr + 0.00 sys = 2.02 CPU) @ 44
+46640.32/s (n=9000000)
regex: 4 wallclock secs ( 2.40 usr + -0.01 sys = 2.39 CPU) @ 37
+60969.49/s (n=9000000)
Benchmark: timing 9000000 iterations of index, regex...
index: 4 wallclock secs ( 4.55 usr + 0.00 sys = 4.55 CPU) @ 19
+79762.43/s (n=9000000)
regex: 3 wallclock secs ( 3.68 usr + 0.00 sys = 3.68 CPU) @ 24
+48313.38/s (n=9000000)
Update:
Ack. I really need to learn to type faster.
| [reply] [d/l] [select] |
Re: You don't always have to use regexes
by Anonymous Monk on Feb 24, 2005 at 03:32 UTC
|
Code Smarter: Compulsory linke to Japhy's node making the same sugestion, and more.
Edited by davido: fixed broken link.
| [reply] |
Re: You don't always have to use regexes
by ysth (Canon) on Feb 24, 2005 at 19:34 UTC
|
A proper translation of if ( $value =~ /^true$/i ) would be:
if ( lc $value eq "true" || lc $value eq "true\n" )
(except that the former potentially sets $&, $`, and $' and
the last-successful-regex).
| [reply] [d/l] [select] |
|
Yes, but that check for "\n" is really irrelevant. It's required to be functionally identically, but not semantically.
Semantics are the real issue here. The regex is saying "Do you have a string that matches the beginning of the string, then t, r, u, e and then the end of the string", and the compare is saying "Is the string the word 'true'?"
"Is this the word I want" is the real intent.
| [reply] |
|
My point was that that is not what the regex is saying. Just my own personal bonnet-bee, but people
misinterpret $ way too often, and I feel it deserves publicity whenever it comes up.
| [reply] |
Re: You don't always have to use regexes
by PetaMem (Priest) on Feb 24, 2005 at 21:57 UTC
|
I suppose, the whole meaning of this example is to show how
to programm efficiently - not wasting system ressources (here: CPU time).
If this is so, I'd like to put emphasis on the fact, that NO ONE here seems to see a problem in the "true" expression. Please do not use interpolation if you do not need it. Try your benchmarks with 'true' again.
Update:
Of course I did the benchmarks before posting this node. The speed differences are not extraordinary but constantly about 5%
| [reply] [d/l] [select] |
|
Actually, many of us saw it. But we also saw this: Re: To Single Quote or to Double Quote: a benchmark. The point is, the difference in speed is practically meaningless. In the grand scheme of the transition from $value =~ /true/i to lc $value eq "true", changing that to lc $value eq 'true' is going to have a demonstrably small effect.
| [reply] [d/l] [select] |
|
And to support your point, an invariant string inside double-quotes gets compiled down to a single quoted string. Any time wasted is not wasted at run-time.
$cat print.pl
print 'Hello';
print "Hello"; # compiles to 'Hello'
print "Hello $_";
$perl -MO=Deparse print.pl
print 'Hello';
print 'Hello';
print "Hello $_";
print.pl syntax OK
5.005_03, 5.6.1 and 5.8.4 produce identical results. | [reply] [d/l] |
|
|
I suppose, the whole meaning of this example is to show how to programm efficiently - not wasting system ressources (here: CPU time).
Absolutely not. That has nothing to do with it. CPU efficiencies on the scale that we're talking about are irrelevant.
The point is to use the construct that most closely matches the semantics of what you're trying to achieve. If you're wondering if one string is the word "true", then that's not a pattern match, it's a string comparison.
| [reply] |
|
If you're wondering if one string is the word "true", then that's not a pattern match, it's a string comparison.
Ok, I second that. Probably I was mislead by the immediate popup of benchmarks in this thread.
| [reply] |