good chemistry is complicated,
and a little bit messy -LW
Regexes are slow (or, why I advocate String::Index)by japhy (Canon)
|on May 14, 2004 at 21:44 UTC||Need Help??|
This comes from Truncating Last Sentence. This is a discussion on why regexes might not be so hot in a particular arena, and why my module helps out.
I'd like it to be known that, on a string with many X's in it, doing
is not very fast. It tries to match at each X it finds. You'd probably be better off doing
Although in that case, it'd be sweet to use Regexp::Keep and say:
or do it the two-regex way (since \K isn't core Perl, it's not as fast):
Or you could reverse the string:
Of course, for this task, a regex is probably the wrong tool. You can use string functions:
Here's a benchmark of these methods (leaving out \K, because it's not worth showing):
Notice how much better-suited substr() is for this task.
You encounter a problem in speed, though, when X is not just a character, but a character class. First of all, our substr() approach fails immediately, because it uses index(), which looks for a substring, not one of a set of characters. Let's do the same benchmark, but change X to the character class of A-Z.
This slow-down is caused by the character class. Because we're not looking for a SINGLE character, we can't jump backwards (the regex engine knows how to handle /.*A/ quickly -- it can "jump" backward to an "A", instead of examining each character -- but it can't handle /.*[AB]/ as quickly).
So what can we do? We can use String::Index, which gives us functions that act like C's strpbrk(), but can do even more. For those of you not familiar with strpbrk() (whose name I can't decipher), it takes a string to look at and a string of characters to find in that source string.
In Perl, it'd be like doing:
That is, it returns the earliest location in the source string of one of the characters in the second string. It's uncool that there's no standard C function that does this from the back of the string, or for all characters except those given...
That's what the String::Index module was written to do! Let's apply it to this problem and run another benchmark:
We are restored. It's not as fast as the original substr() approach because it has to do more work, but it's faster than any other solution.