Regexes are slow (or, why I advocate String::Index)

This comes from Truncating Last Sentence. This is a discussion on why regexes might not be so hot in a particular arena, and why my module helps out.

I'd like it to be known that, on a string with many X's in it, doing

$str =~ s/X[^X]*$/X/;
[download]

is not very fast. It tries to match at each X it finds. You'd probably be better off doing

$str =~ s/(.*X).*/$1/;
[download]

Although in that case, it'd be sweet to use Regexp::Keep and say:

$str =~ s/.*X\K.*//;
[download]

or do it the two-regex way (since \K isn't core Perl, it's not as fast):

$str =~ /.*X/g and $str =~ s/\G.*//;
[download]

Or you could reverse the string:

($str = reverse $str) =~ s/^[^X]+//;
$str = reverse $str;
[download]

Of course, for this task, a regex is probably the wrong tool. You can use string functions:

substr($str, rindex($str, "X")+1) = "";
[download]

Here's a benchmark of these methods (leaving out \K, because it's not worth showing):

use Benchmark 'cmpthese';
my $str = "alphabet X alphabet" x 100 . "junk at the end" x 10;

cmpthese(-5, {
  # is this the last X?
  last => sub { my $x = $str; $x =~ s/X[^X]*$/X/ },

  # capture up to the last X and replace it with itself
  capt_repl => sub { my $x = $str; $x =~ s/(.*X).*/$1/ },

  # match up to the last X, then remove everything after it
  rx_rx => sub { my $x = $str; $x =~ /.*X/g and $x =~ s/\G.*// },

  # reverse, remove up to first X, reverse
  sexeger => sub {
    my $x = $str; ($x = reverse $x) =~ s/^[^X]+//; $x = reverse $x;
  },

  # find the last X, remove everything after it
  substr => sub { my $x = $str; substr($x, rindex($x, "X")+1) = "" },
});

__END__
               Rate      last   sexeger capt_repl     rx_rx    substr
last        6398/s        --      -84%      -84%      -90%      -97%
sexeger    39625/s      519%        --       -4%      -37%      -83%
capt_repl  41274/s      545%        4%        --      -35%      -82%
rx_rx      63380/s      891%       60%       54%        --      -73%
substr    230796/s     3507%      482%      459%      264%        --
[download]

Notice how much better-suited substr() is for this task.

You encounter a problem in speed, though, when X is not just a character, but a character class. First of all, our substr() approach fails immediately, because it uses index(), which looks for a substring, not one of a set of characters. Let's do the same benchmark, but change X to the character class of A-Z.

cmpthese(-5, {
  last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ },
  capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ },
  rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// },
  sexeger => sub {
    my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x;
  },
});

__END__
             Rate      last capt_repl     rx_rx   sexeger
last       5591/s        --      -82%      -86%      -86%
capt_repl 30599/s      447%        --      -21%      -24%
rx_rx     38948/s      597%       27%        --       -3%
sexeger   40049/s      616%       31%        3%        --
[download]

This slow-down is caused by the character class. Because we're not looking for a SINGLE character, we can't jump backwards (the regex engine knows how to handle /.*A/ quickly -- it can "jump" backward to an "A", instead of examining each character -- but it can't handle /.*[AB]/ as quickly).

So what can we do? We can use String::Index, which gives us functions that act like C's strpbrk(), but can do even more. For those of you not familiar with strpbrk() (whose name I can't decipher), it takes a string to look at and a string of characters to find in that source string.

/* C code */
printf("%s", strpbrk("breakfast", "aeiou"));
/* prints: eakfast */
[download]

In Perl, it'd be like doing:

if ("breakfast" =~ /[aeiou]/) {
  $x = substr("breakfast", $-[0]);
}
[download]

That is, it returns the earliest location in the source string of one of the characters in the second string. It's uncool that there's no standard C function that does this from the back of the string, or for all characters except those given...

That's what the String::Index module was written to do! Let's apply it to this problem and run another benchmark:

# crindex is char-class-based rindex()
use String::Index 'crindex';

cmpthese(-5, {
  last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ },
  capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ },
  rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// },
  sexeger => sub {
    my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x;
  },
  crindex => sub {
    my $x = $str;
    substr($x, crindex($x, "ABCDEFGHIJKLMNOPQRSTUVWXYZ")+1) = "";
  },
});

__END__
             Rate      last capt_repl     rx_rx   sexeger   crindex
last       5581/s        --      -82%      -85%      -86%      -90%
capt_repl 30606/s      448%        --      -20%      -23%      -44%
rx_rx     38194/s      584%       25%        --       -4%      -30%
sexeger   39594/s      609%       29%        4%        --      -28%
crindex   54829/s      882%       79%       44%       38%        --
[download]

We are restored. It's not as fast as the original substr() approach because it has to do more work, but it's faster than any other solution.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Comment on Regexes are slow (or, why I advocate String::Index) Select or Download Code

Replies are listed 'Best First'.
Re: Regexes are slow (or, why I advocate String::Index) by revdiablo (Prior) on May 14, 2004 at 23:46 UTC
`strpbrk()` (whose name I can't decipher) In case anyone still wonders, the function name has been demystified: `<japhy> string pointer break` Which still might not make much more sense (it doesn't to me), but at least we know what it stands for.	[reply] [d/l] [select]
Re: Re: Regexes are slow (or, why I advocate String::Index) by Fletch (Bishop) on May 15, 2004 at 03:18 UTC
`char strpbrk(const char s, const char charset);` It takes a string (char) and returns a pointer (char ) to the position to break out the portion of the first string at a position with a character from the charset string. OK, not much better. But it is more or less orthagonal to things such as compare no more than n characters from two str*ings (`strncmp()`).	[reply] [d/l] [select]
Re: Re: Re: Regexes are slow (or, why I advocate String::Index) by jordanh (Chaplain) on May 15, 2004 at 19:46 UTC
It takes a string (char) and returns a pointer (char ) to the position to break out the portion of the first string at a position with a character from the charset string. I think that this nomenclature goes back to SNOBOL pattern matching, which dates back to 1962. There are these two complementary pattern matching operators in SNOBOL: `SPAN() BREAK()` [download] SPAN matches a string containing any characters in its argument list, BREAK matches a string containing any characters not in its argument list. I think the implied mnemonic is to SPAN a substring containing a character set or BREAK (stop matching) on a substring containing anything not in the character set.	[reply] [d/l]
Re: Regexes are slow (or, why I advocate String::Index) by halley (Prior) on May 17, 2004 at 13:59 UTC
My first reaction is, deciding to go with `index()` for anything that even remotely smells like natural language processing development is probably premature optimization. NLP code is organic code; organic programming features like patterns or templates will benefit the developer. Sure, `index()` is faster than `s///`. But only for the things that `index()` can solve. With much of natural language processing, you're probably going to try a LOT of alternative forms, and grammars, and minor adjustments until you get it right. Regexen may be slower to run, but they're faster to develop in any but the simplest of cases. If you develop your code with regexen, and end up realizing that a few of your lines could "benefit" from a simple `index()` replacement, go ahead and replace it. I doubt that you'll replace 1% of your whole NLP code in a typical project, but you'll spend a lot of time hunting for it and verifying that the replacements didn't break anything. And if you later realize you need to tweak the NLP again, you might have to undo your little optimizations. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re: Regexes are slow (or, why I advocate String::Index) by William G. Davis (Friar) on May 21, 2004 at 04:31 UTC
The problem with Perl's regular expressions is that they're so easy to use that once you learn them, it can be quite tempting to use them for everything. A lot of skilled Perl hackers actually shun the string manipulation functions as crutches of the novice Perl programmer. Anytime I need to find or extract a specific character or string from within another string, instead of resorting to the regex engine I use the index(), rindex(), and substr() functions. The real downside to doing this is that you run the risk of coming off looking like a novice to your peers by not using regexps. I guess if anyone ever calls me on it, I'll just say, "So what? Lincoln Stein does it that way too!" :) `#!/usr/bin/perl -w use strict; use Benchmark 'cmpthese'; my $string = "this is a string" x 300; cmpthese(10000000, { 'index' => sub { my $res; $res = index($string, "this", 0); $res = index($string, "string"); $res = rindex($string, "string"); }, regex => sub { my $res; $res = $string =~ /^this/; $res = $string =~ /string/; $res = $string =~ /string$/; } }); ...` [download] Benchmark: timing 10000000 iterations of index, regex... index: 10 wallclock secs (10.21 usr + 0.00 sys = 10.21 CPU) @ 979431.93/s (n=10000000) regex: 20 wallclock secs (20.05 usr + 0.00 sys = 20.05 CPU) @ 498753.12/s (n=10000000) Rate regex index regex 498753/s -- -49% index 979432/s 96% --	[reply] [d/l]

Back to Meditations