Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

is index faster than regexp for fixed text token?

by sflitman (Hermit)
on Jul 05, 2009 at 08:27 UTC ( #777323=perlquestion: print w/replies, xml ) Need Help??
sflitman has asked for the wisdom of the Perl Monks concerning the following question:

I was wondering whether it made sense to write things like if (index($options,'checkboxes')>-1) {...} instead of if ($options=~/checkboxes/) {...} for speed. So I whipped up this little benchmark:
#!/usr/bin/perl # regindbench - which is faster, regexp for fixed text patterns or ind +ex()? use strict; use Benchmark qw/:all/; sub random { int rand shift } my @letters=('A'..'Z'); my $nLetters=scalar @letters; my $text=join('',map { $letters[random($nLetters)] } 0..9999); # 10 +,000 chars my ($pat,@pats); map { do { $pat=join('',map { $letters[random($nLetters)] } 0..4); # 5 cha +r patterns } until index($text,$pat)==-1; # which never match push @pats,$pat; } 0..99; sub via_regexp { map { die "via_regexp $_ found\n" if $text=~/$_/ } @pats; } sub via_index { map { die "via_index $_ found\n" if index($text,$_)>-1 } @pats; } cmpthese(100000,{regexp=>\&via_regexp,index=>\&via_index}); exit; __END__ Perl 5.8.8 using 100000 iterations on a 3 GHz 64-bit Intel QuadCore running Fedora Core 7 (6000 bogomips per core) Rate index regexp index 370/s -- -47% regexp 695/s 88% --
As you can see, regexp is about twice as fast as using index for running 100 failed matches of fixed text against a single text. I am aware of Regexp::List but wanted to use something which tests unoptimized Perl for really single cases.

If I add study $text, there is a fairly impressive improvement, but then again I'm not going to be using study for every comparison:

study $text; # added before call to cmpthese above Rate index regexp index 364/s -- -71% regexp 1274/s 250% --
What do my fellow monks think? Shouldn't regexp have more overhead than a simple strstr? Or is this that Magic we keep hearing about? ;-)


Replies are listed 'Best First'.
Re: is index faster than regexp for fixed text token?
by roboticus (Chancellor) on Jul 05, 2009 at 13:12 UTC

    I'd suggest that there could be some "magic" involved. Are you familiar with the Boyer-Moore string search algorithm? I recall being surprised when I originally saw it. A typical index function might be implemented as a typical character-by-character search: e.g. Look for the first character in the target that matches the first character in the string to find, then do a string comparison at each point you find it. It seems pretty fast, and at first glance, it doesn't seem to be possible to improve it by much.

    But after you read about the Boyer-Moore algorithm, you see that it's quite possible for you to the search for the string while looking at only a fraction of the string. (In the linked Wikipedia article, you'll see that in their example search, the typical case is to skip eight characters for the next character to examine. In my mind, that nearly qualifies as magic!) With all the work involved in building the state machine for searching, it's a sure bet that they take advantage of all the good tricks to make the search faster.

    Now, I haven't evaluated the benchmark you created (I haven't drunk my morning coffee yet), but you might want to play with it a bit to vary the size of your search string, your buffer size, and the percentage of hits to see if you can figure out the "rules" that make one better than the other for (a) your machine, (b) the current version of perl, etc.

    Looking at degenerate cases, it's easy to see that the mundane index implementation is going to be fastest with tiny search strings and tiny buffers that you execute only once. After all, in this case there are only a few compares, and there's no overhead of building a state machine. But once you work with very large strings, and you can re-use your state machine, the Boyer-Moore algorithm is obviously better. The overhead of building the state machine is significant, but if you can look at only 10% of your buffer, you can save some serious cycles. But where the dividing line is between them shifts as the technology advances--I have no idea where the border is right now.

      Excellent thoughts, thank you. The question for me is in squeezing more performance out of a mod_perl web application which needs to check for named options in tiny search strings, so perhaps the mundane index implementation is better there. It is faster for me to type regexp matches, so I'll probably use them more, and regexp matches can have /i so I can make them case-insensitive which with index requires lc on both arguments.

      I studied Boyer-Moore in my first C Algorithms class under the very demanding Dr. Wong. I'll never forget him, he took points off my code for forgetting to comment, and I learned as much as 50% of my programming craft from him. Nearly failed that class! I was such a geek and I signed up for it as a freshman, when he taught it like a graduate course. I got the impression that he didn't like me...but that probably made me study harder! He also showed me the XOR trick for swapping two registers without an intermediary:

      $a^=$b; $b^=$a; $a^=$b; # swap a and b



        I'd definitely update the benchmark to test cases that are close to what you're using in your application, just to make sure you get the best result. Include the lc calls, as well, so you can see if they push you to the regex side.

        Oh, yeah, if you're going to do other matches as well, be sure to see if you can incorporate them into the regex to let it get even more performance for you.

        Regarding your class--I always found it best to take the course with the teachers other students considered "the hardest", as I found I usually learned more from them. And because of their reputation, I'd keep up on the notes & homework. It's amazing how easy a difficult class is if you keep up on the reading and homework.


        Update: Added second paragraph.

        "He also showed me the XOR trick for swapping two registers without an intermediary":
        $a^=$b; $b^=$a; $a^=$b; # swap a and b

        This "trick" is part of first level class in 'C'. It is shown to demonstrate power of XOR. It looks cool but it is inefficient and in general not a good idea even though it produces a correct result. XOR is "more expensive", meaning takes longer than other simple bitwise ops like OR or AND or NOT. Anyway this construct is just demonstrated to explain XOR, it is not practical is not used even in 'C'.

        In Perl, this is bad code! - just wrong.

        The Perl way: ($a,$b)=($b,$a);
        the above is practical and could be used.

Re: is index faster than regexp for fixed text token?
by ikegami (Pope) on Jul 05, 2009 at 14:42 UTC
    I get the same performance for both:
    Rate index regexp index 299/s -- -5% regexp 316/s 6% -- Rate regexp index regexp 294/s -- -12% index 332/s 13% -- Rate regexp index regexp 303/s -- -3% index 313/s 3% --

    That's what I'd expect, since index and regex matches use the same algorithm (Boyer-Moore) to find constant strings

      I don't know, I definitely get a big difference. This is with Perl 5.10.0
      Without study: Rate index regexp index 466/s -- -27% regexp 639/s 37% -- With study: Rate index regexp index 467/s -- -54% regexp 1019/s 118% --
      Doesn't it make sense that regexp has an advantage in 5.10 since it is building a trie?


        I got similar results...index faster on my machine!

        Rate regexp index regexp 848/s -- -74% index 3269/s 286% --
        I am unsure why the Ikegami machine does so well on regex. He does have a 64 bit vs 32 bit machine and I don't see why it should make so much difference, but there could be something about that 64 vs 32 bit that speeds things up a lot.

        I am running Perl 5.10 which is significantly better on regex than Perl 5.8. I suppose there could be other things related to the power of the Ikegami machine like more L2 cache.


        I ran with longer strings to be searched and result does appear to approach the same execution time for both cases:

        Rate index regexp index 470/s -- -5% regexp 494/s 5% --

        This is an interesting finding. I am not sure why this happens. What I'm guessing is that when the string to be searched is relatively "small", (some ~10+ thousands of chars), index() works better (caveat: in this "search for string X" situation!) because although "dumb" it is fast. But at some point some more "computationally expensive" regex algorithm "gains ground".

        This is an interesting question for which I have no general heuristic. I think it depends upon length of string to be searched, length of string that we are looking for, the data in each string and perhaps a lot more!

      I don't know, I definitely get a big difference. This is with Perl 5.10.0
      Without study: Rate index regexp index 466/s -- -27% regexp 639/s 37% -- With study: Rate index regexp index 467/s -- -54% regexp 1019/s 118% --
      Doesn't it make sense that regexp has an advantage in 5.10 since it is building a trie like Regexp::Assemble?


        There's no trie involved since there's no alternation.
Re: is index faster than regexp for fixed text token?
by JavaFan (Canon) on Jul 06, 2009 at 09:01 UTC
    Now you have only considered failed matches, with random strings. If you want to make an argument to go with either a regexp or index() you should also consider strings with matches, consider where the match happens (beginning, end of string), and what happens with near matches. (I.e. searching for 'abba' in 'ab' x 1000).

    Of course, in practice, most searches will be of small strings in small strings, neither of them very random.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://777323]
Approved by marto
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2018-03-17 18:40 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (225 votes). Check out past polls.