Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Regexes are slow (or, why I advocate String::Index)

by japhy (Canon)
on May 14, 2004 at 21:44 UTC ( #353505=perlmeditation: print w/replies, xml ) Need Help??

This comes from Truncating Last Sentence. This is a discussion on why regexes might not be so hot in a particular arena, and why my module helps out.

I'd like it to be known that, on a string with many X's in it, doing

$str =~ s/X[^X]*$/X/;
is not very fast. It tries to match at each X it finds. You'd probably be better off doing
$str =~ s/(.*X).*/$1/;
Although in that case, it'd be sweet to use Regexp::Keep and say:
$str =~ s/.*X\K.*//;
or do it the two-regex way (since \K isn't core Perl, it's not as fast):
$str =~ /.*X/g and $str =~ s/\G.*//;
Or you could reverse the string:
($str = reverse $str) =~ s/^[^X]+//; $str = reverse $str;
Of course, for this task, a regex is probably the wrong tool. You can use string functions:
substr($str, rindex($str, "X")+1) = "";
Here's a benchmark of these methods (leaving out \K, because it's not worth showing):
use Benchmark 'cmpthese'; my $str = "alphabet X alphabet" x 100 . "junk at the end" x 10; cmpthese(-5, { # is this the last X? last => sub { my $x = $str; $x =~ s/X[^X]*$/X/ }, # capture up to the last X and replace it with itself capt_repl => sub { my $x = $str; $x =~ s/(.*X).*/$1/ }, # match up to the last X, then remove everything after it rx_rx => sub { my $x = $str; $x =~ /.*X/g and $x =~ s/\G.*// }, # reverse, remove up to first X, reverse sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^[^X]+//; $x = reverse $x; }, # find the last X, remove everything after it substr => sub { my $x = $str; substr($x, rindex($x, "X")+1) = "" }, }); __END__ Rate last sexeger capt_repl rx_rx substr last 6398/s -- -84% -84% -90% -97% sexeger 39625/s 519% -- -4% -37% -83% capt_repl 41274/s 545% 4% -- -35% -82% rx_rx 63380/s 891% 60% 54% -- -73% substr 230796/s 3507% 482% 459% 264% --
Notice how much better-suited substr() is for this task.

You encounter a problem in speed, though, when X is not just a character, but a character class. First of all, our substr() approach fails immediately, because it uses index(), which looks for a substring, not one of a set of characters. Let's do the same benchmark, but change X to the character class of A-Z.

cmpthese(-5, { last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ }, capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ }, rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// }, sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x; }, }); __END__ Rate last capt_repl rx_rx sexeger last 5591/s -- -82% -86% -86% capt_repl 30599/s 447% -- -21% -24% rx_rx 38948/s 597% 27% -- -3% sexeger 40049/s 616% 31% 3% --
This slow-down is caused by the character class. Because we're not looking for a SINGLE character, we can't jump backwards (the regex engine knows how to handle /.*A/ quickly -- it can "jump" backward to an "A", instead of examining each character -- but it can't handle /.*[AB]/ as quickly).

So what can we do? We can use String::Index, which gives us functions that act like C's strpbrk(), but can do even more. For those of you not familiar with strpbrk() (whose name I can't decipher), it takes a string to look at and a string of characters to find in that source string.

/* C code */ printf("%s", strpbrk("breakfast", "aeiou")); /* prints: eakfast */
In Perl, it'd be like doing:
if ("breakfast" =~ /[aeiou]/) { $x = substr("breakfast", $-[0]); }
That is, it returns the earliest location in the source string of one of the characters in the second string. It's uncool that there's no standard C function that does this from the back of the string, or for all characters except those given...

That's what the String::Index module was written to do! Let's apply it to this problem and run another benchmark:

# crindex is char-class-based rindex() use String::Index 'crindex'; cmpthese(-5, { last => sub { my $x = $str; $x =~ s/([A-Z])[^A-Z]*$/$1/ }, capt_repl => sub { my $x = $str; $x =~ s/(.*[A-Z]).*/$1/ }, rx_rx => sub { my $x = $str; $x =~ /.*[A-Z]/g and $x =~ s/\G.*// }, sexeger => sub { my $x = $str; ($x = reverse $x) =~ s/^[^A-Z]+//; $x = reverse $x; }, crindex => sub { my $x = $str; substr($x, crindex($x, "ABCDEFGHIJKLMNOPQRSTUVWXYZ")+1) = ""; }, }); __END__ Rate last capt_repl rx_rx sexeger crindex last 5581/s -- -82% -85% -86% -90% capt_repl 30606/s 448% -- -20% -23% -44% rx_rx 38194/s 584% 25% -- -4% -30% sexeger 39594/s 609% 29% 4% -- -28% crindex 54829/s 882% 79% 44% 38% --
We are restored. It's not as fast as the original substr() approach because it has to do more work, but it's faster than any other solution.
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re: Regexes are slow (or, why I advocate String::Index)
by revdiablo (Prior) on May 14, 2004 at 23:46 UTC
    strpbrk() (whose name I can't decipher)

    In case anyone still wonders, the function name has been demystified:

        <japhy> string pointer break

    Which still might not make much more sense (it doesn't to me), but at least we know what it stands for.

      char *strpbrk(const char *s, const char *charset);

      It takes a string (char*) and returns a pointer (char *) to the position to break out the portion of the first string at a position with a character from the charset string.

      OK, not much better. But it is more or less orthagonal to things such as compare no more than n characters from two strings (strncmp()).

          It takes a string (char*) and returns a pointer (char *) to the position to break out the portion of the first string at a position with a character from the charset string.
        I think that this nomenclature goes back to SNOBOL pattern matching, which dates back to 1962.

        There are these two complementary pattern matching operators in SNOBOL:

        SPAN matches a string containing any characters in its argument list, BREAK matches a string containing any characters not in its argument list. I think the implied mnemonic is to SPAN a substring containing a character set or BREAK (stop matching) on a substring containing anything not in the character set.

Re: Regexes are slow (or, why I advocate String::Index)
by halley (Prior) on May 17, 2004 at 13:59 UTC
    My first reaction is, deciding to go with index() for anything that even remotely smells like natural language processing development is probably premature optimization. NLP code is organic code; organic programming features like patterns or templates will benefit the developer.

    Sure, index() is faster than s///. But only for the things that index() can solve.

    With much of natural language processing, you're probably going to try a LOT of alternative forms, and grammars, and minor adjustments until you get it right. Regexen may be slower to run, but they're faster to develop in any but the simplest of cases.

    If you develop your code with regexen, and end up realizing that a few of your lines could "benefit" from a simple index() replacement, go ahead and replace it. I doubt that you'll replace 1% of your whole NLP code in a typical project, but you'll spend a lot of time hunting for it and verifying that the replacements didn't break anything. And if you later realize you need to tweak the NLP again, you might have to undo your little optimizations.

    [ e d @ h a l l e y . c c ]

Re: Regexes are slow (or, why I advocate String::Index)
by William G. Davis (Friar) on May 21, 2004 at 04:31 UTC

    The problem with Perl's regular expressions is that they're so easy to use that once you learn them, it can be quite tempting to use them for everything. A lot of skilled Perl hackers actually shun the string manipulation functions as crutches of the novice Perl programmer.

    Anytime I need to find or extract a specific character or string from within another string, instead of resorting to the regex engine I use the index(), rindex(), and substr() functions. The real downside to doing this is that you run the risk of coming off looking like a novice to your peers by not using regexps. I guess if anyone ever calls me on it, I'll just say, "So what? Lincoln Stein does it that way too!" :)

    #!/usr/bin/perl -w use strict; use Benchmark 'cmpthese'; my $string = "this is a string" x 300; cmpthese(10000000, { 'index' => sub { my $res; $res = index($string, "this", 0); $res = index($string, "string"); $res = rindex($string, "string"); }, regex => sub { my $res; $res = $string =~ /^this/; $res = $string =~ /string/; $res = $string =~ /string$/; } }); ...
    Benchmark: timing 10000000 iterations of index, regex...
         index: 10 wallclock secs (10.21 usr +  0.00 sys = 10.21 CPU) @ 979431.93/s
         regex: 20 wallclock secs (20.05 usr +  0.00 sys = 20.05 CPU) @ 498753.12/s
              Rate regex index
    regex 498753/s    --  -49%
    index 979432/s   96%    --

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://353505]
Approved by Corion
Front-paged by BrowserUk
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2017-03-25 21:10 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (313 votes). Check out past polls.