I'd like it to be known that, on a string with many X's in it, doing
is not very fast. It tries to match at each X it finds. You'd probably be better off doing$str =~ s/X[^X]*$/X/;
Although in that case, it'd be sweet to use Regexp::Keep and say:$str =~ s/(.*X).*/$1/;
or do it the two-regex way (since \K isn't core Perl, it's not as fast):$str =~ s/.*X\K.*//;
Or you could reverse the string:$str =~ /.*X/g and $str =~ s/\G.*//;
Of course, for this task, a regex is probably the wrong tool. You can use string functions:($str = reverse $str) =~ s/^[^X]+//; $str = reverse $str;
Here's a benchmark of these methods (leaving out \K, because it's not worth showing):substr($str, rindex($str, "X")+1) = "";
Notice how much better-suited substr() is for this task.use Benchmark 'cmpthese'; my $str = "alphabet X alphabet" x 100 . "junk at the end" x 10; cmpthese(-5, { # is this the last X? last => sub { my $x = $str; $x =~ s/X[^X]*$/X/ }, # capture up to the last X and replace it with itself capt_repl => sub { my $x = $str; $x =~ s/(.*X).*/$1/ }, # match up to the last X, then remove everything after it rx_rx => sub { my $x = $str; $x =~ /.*X/g and $x =~ s/\G.*// }, # reverse, remove up to first X, reverse sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^[^X]+//; $x = reverse $x; }, # find the last X, remove everything after it substr => sub { my $x = $str; substr($x, rindex($x, "X")+1) = "" }, }); __END__ Rate last sexeger capt_repl rx_rx substr last 6398/s -- -84% -84% -90% -97% sexeger 39625/s 519% -- -4% -37% -83% capt_repl 41274/s 545% 4% -- -35% -82% rx_rx 63380/s 891% 60% 54% -- -73% substr 230796/s 3507% 482% 459% 264% --
You encounter a problem in speed, though, when X is not just a character, but a character class. First of all, our substr() approach fails immediately, because it uses index(), which looks for a substring, not one of a set of characters. Let's do the same benchmark, but change X to the character class of A-Z.
This slow-down is caused by the character class. Because we're not looking for a SINGLE character, we can't jump backwards (the regex engine knows how to handle /.*A/ quickly -- it can "jump" backward to an "A", instead of examining each character -- but it can't handle /.*[AB]/ as quickly).cmpthese(-5, { last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ }, capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ }, rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// }, sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x; }, }); __END__ Rate last capt_repl rx_rx sexeger last 5591/s -- -82% -86% -86% capt_repl 30599/s 447% -- -21% -24% rx_rx 38948/s 597% 27% -- -3% sexeger 40049/s 616% 31% 3% --
So what can we do? We can use String::Index, which gives us functions that act like C's strpbrk(), but can do even more. For those of you not familiar with strpbrk() (whose name I can't decipher), it takes a string to look at and a string of characters to find in that source string.
In Perl, it'd be like doing:/* C code */ printf("%s", strpbrk("breakfast", "aeiou")); /* prints: eakfast */
That is, it returns the earliest location in the source string of one of the characters in the second string. It's uncool that there's no standard C function that does this from the back of the string, or for all characters except those given...if ("breakfast" =~ /[aeiou]/) { $x = substr("breakfast", $-[0]); }
That's what the String::Index module was written to do! Let's apply it to this problem and run another benchmark:
We are restored. It's not as fast as the original substr() approach because it has to do more work, but it's faster than any other solution.# crindex is char-class-based rindex() use String::Index 'crindex'; cmpthese(-5, { last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ }, capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ }, rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// }, sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x; }, crindex => sub { my $x = $str; substr($x, crindex($x, "ABCDEFGHIJKLMNOPQRSTUVWXYZ")+1) = ""; }, }); __END__ Rate last capt_repl rx_rx sexeger crindex last 5581/s -- -82% -85% -86% -90% capt_repl 30606/s 448% -- -20% -23% -44% rx_rx 38194/s 584% 25% -- -4% -30% sexeger 39594/s 609% 29% 4% -- -28% crindex 54829/s 882% 79% 44% 38% --
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Regexes are slow (or, why I advocate String::Index)
by revdiablo (Prior) on May 14, 2004 at 23:46 UTC | |
by Fletch (Bishop) on May 15, 2004 at 03:18 UTC | |
by jordanh (Chaplain) on May 15, 2004 at 19:46 UTC | |
Re: Regexes are slow (or, why I advocate String::Index)
by halley (Prior) on May 17, 2004 at 13:59 UTC | |
Re: Regexes are slow (or, why I advocate String::Index)
by William G. Davis (Friar) on May 21, 2004 at 04:31 UTC |