http://www.perlmonks.org?node_id=432290

JollyJinx has asked for the wisdom of the Perl Monks concerning the following question:

Fellow believers, I've got a problem with the perl regex compiler. It seems that compliation of combined regexes ( or alternation whatever you call it ) is not optimized.

Using a /(foo|bar)/ regex on strings is slower than using a foreach loop doing the matching one after another. I've written a testprogramm and looked at the perl source to find out why. Now I know. It seems that DFA won't get optimised for the alternation.

As I have no time and knowledge and skill for optimising the perlregex compiler from scratch, what can I do. Programming such foreach loops gives me headaches - it such 'awk'ward.

I need to get those regexes fast as nowadays the strings I'm working on tend to get larger ( e.g. xml-files ) - any idea ?

Jolly

Here's the testprogram for those of you that don't think it's true:

#!/bin/perl use strict; use Digest::MD5 qw(md5 md5_hex md5_base64); use Time::HiRes qw(time ); #use re 'debug'  ; foreach my $regexcount (1,5,10) {         foreach my $regexlength (2,5,10,20)         {                 my @items       = map{ createRandomTextWithLength($reg +exlength); } (1..$regexcount);                 my $regexstr    = join('|',@items);                 my $regex               = qr /(?:$regexstr)/;                 foreach my $stringlength (100,1000,10000,100000)                 {                         print localtime()." Stringlength: $stringlengt +h Number of Regexes:$regexcount Length of each Regex:$regexlength\n";                         my $teststring = createRandomTextWithLength($s +tringlength);                         my $timer;                         {                                 my $test=$teststring;                                 $timer =time;                                 $test =~ s/$regex/foobar/g;                                 printf("ElapsedTime:%5.4f  %20s %20s\n",time-$timer,md5_hex($test),$regex);                         }                         {                                 my $test=$teststring;                                 $timer =time;                                 foreach my $oneregex (@items)                                 {                                         $test =~ s/$oneregex/foobar/g; +                                 }                                 printf("ElapsedTime:%5.4f  %20s %20s\n",time-$timer,md5_hex($test),' for loop over '.join(',',@items)) +;                         }                         print "\n";                 }         } } sub createRandomTextWithLength($) {         my($count) = (@_);         my $string;         for (1.. $count)         {                 $string.=chr(ord('a')+rand(20));         }         return $string; }
  • Comment on Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ???  
  • Download Code

Replies are listed 'Best First'.
Re: Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ???
by Roy Johnson (Monsignor) on Feb 18, 2005 at 13:18 UTC
    Using a /(foo|bar)/ regex on strings is slower than using a foreach loop doing the matching one after another.
    Not in my benchmark:
    use strict; use warnings; use Benchmark 'cmpthese'; undef $/; my $str = <DATA>; cmpthese(-2, { alternation => sub { my @m = $str =~ /(foo|bar)/g }, map => sub { my @m = map $str =~ /$_/g, qw(foo bar) }, loop => sub { my @m; push @m, $str =~ /$_/g for qw(foo bar) } +, }); cmpthese(-2, { alt_s => sub { my $s2 = $str; $s2 =~ s/(foo|bar)//g }, loop_s => sub { my $s2 = $str; $s2 =~ s/$_//g for qw(foo bar) } }); __DATA__ This string contains foo and bar for fools and bards and I pity the foo who bars the way
    Rate map loop alternation map 14977/s -- -6% -25% loop 15986/s 7% -- -20% alternation 20090/s 34% 26% -- Rate loop_s alt_s loop_s 17436/s -- -37% alt_s 27643/s 59% --
    Update:
    But I converted your example program to use Benchmark, and I do get better results with for:
    use strict; use Benchmark 'cmpthese'; foreach my $regexcount (10) { foreach my $regexlength (5,20) { my @items = map{ createRandomTextWithLength($regexlength) } + (1..$regexcount); my $regexstr = join('|',@items); my $regex = qr/(?:$regexstr)/; foreach my $stringlength (1000,100000) { print join "\n", "Stringlength: $stringlength", "Number of Regexes:$regexcount", "Length of each Regex +:$regexlength\n"; my $teststring = createRandomTextWithLength($stringlength) +; cmpthese(-2, { alt => sub { my $test=$teststring; $test =~ s/$regex/foobar/g; }, for => sub { my $test=$teststring; foreach my $oneregex (@items) { $test =~ s/$oneregex/foobar/g; } } }); } } } sub createRandomTextWithLength($) { my($count) = (@_); my $string; for (1.. $count) { $string.=chr(ord('a')+rand(20)) } return $string; }
    Results:
    Stringlength: 1000 Number of Regexes:10 Length of each Regex:5 Rate alt for alt 1296/s -- -70% for 4357/s 236% -- Stringlength: 100000 Number of Regexes:10 Length of each Regex:5 Rate alt for alt 12.7/s -- -97% for 379/s 2877% -- Stringlength: 1000 Number of Regexes:10 Length of each Regex:20 Rate alt for alt 1324/s -- -68% for 4144/s 213% -- Stringlength: 100000 Number of Regexes:10 Length of each Regex:20 Rate alt for alt 12.7/s -- -98% for 676/s 5209% --

    Caution: Contents may have been coded under pressure.

      The for and map implementations have to recompile the regex on each iteration, but the alternation doesn't have that limitation. Here's a modified version that fixes that:

      use strict; use warnings; use Benchmark 'cmpthese'; undef $/; my $str = <DATA>; my @regexen = ( qr/foo/, qr/bar/ ); cmpthese(-2, { alternation => sub { my @m = $str =~ /(foo|bar)/g }, map => sub { my @m = map $str =~ /$_/g, @regexen }, loop => sub { my @m; push @m, $str =~ /$_/g for @regexe +n }, }); cmpthese(-2, { alt_s => sub { my $s2 = $str; $s2 =~ s/(foo|bar)//g }, loop_s => sub { my $s2 = $str; $s2 =~ s/$_//g for @regexen} }); __DATA__ This string contains foo and bar for fools and bards and I pity the foo who bars the way

      I got results more consistant with the OP:

      Rate alternation map loop alternation 26962/s -- -61% -64% map 68918/s 156% -- -9% loop 75551/s 180% 10% -- Rate alt_s loop_s alt_s 32434/s -- -20% loop_s 40707/s 26% --

      Just for kicks, I added a study $str; right after reading the DATA filehandle. I know study is rarely useful, but it's supposed to help in situations where a single string is going to be matched against many different regexen, which is roughly what we have here. Results:

      Rate alternation loop map alternation 26569/s -- -56% -60% loop 60015/s 126% -- -10% map 66863/s 152% 11% -- Rate alt_s loop_s alt_s 31355/s -- -19% loop_s 38857/s 24% --

      The speed drops across the board, but alternation is only affected by 1.5% (which is insignificant noise). But loop gets hit hard.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

        A problem with all of these benchmarks is that the different choices don't actually do the same things. See my response below.

Re: Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ???  
by demerphq (Chancellor) on Feb 18, 2005 at 14:32 UTC

    Ok /(foo|bar)/ disables a bunch of optimizations that /foo/ || /bar/ can take advantage of. In fact the later can benefit from boyer-moore matching which means it may not have to read the full input stream to tell that the words arent present. However, this wont really scale up, slight modifications of the latter make them also disable the optimisations, and once you start looking at lots of patterns I think you will find that alternate techniques such as Regex::PreSuf, Regexp::Optimizer, Regexp::Assemble or Regexp::List will probably start becomming competitive.

    However its worth noting that its very possible that future versions of perl will have optimizations that will make /(foo|bar)/ faster than /foo/ || /bar/. Whatever performance tuning you do will be just that, tuning for your particular enviornment.

    ---
    demerphq

      Thanx for enlightment.

      Using index isn't an option here the program I delivered is an tailored down to verify the original problem.

      The original problem is that I have THOUSANDS of Regexes to match and replace (not just fixed strings like /foo/ ) in a string that is a few hundred k'Bytes long.

      Any further suggestions - Jolly

        If your regexes are more complex than simple fixed strings, then it would seem that optimization-destroying alternation would have less of an effect than on speed than the examples above. The regex engine optimises the search for fixed strings down to a fast Boyer Moore algorithm (really fast; I once spent a week trying to tune it with different versions of BM I know - it beat all comers!) but more general patterns need to use the full engine,

        But if you are so heavily rewriting a string, perhaps it is time to rethink your algorithm. One problem might be that the engine needs to search through the whole string for every search and replace pattern and that search tests the pattern at every position in the string. If you know that the string has a lot of structure and that a given pattern could only match at certain places, attempting a match at every position is a lot of wasted effort.

        You mentioned XML; if you are replacing particular attributes or elements, it may be a whole lot faster to parse the file using XML::Parser with the lightwieght SAX interface, replacing things as you go. That way the string is only traversed once and replacement becomes, e.g., a simple hash lookup.

        Update: added a phrase.

        -Mark

Re: Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ???
by mlh2003 (Scribe) on Feb 18, 2005 at 12:52 UTC
    I believe using regex's - even for relatively simple matching - is expensive compared with the simpler (less elaborate) search builtins like index, or even comparison operators. As you know, index is an optimised function specifically created to find a substring within a string. But using a regex fires up the 'regex engine', which in itself is reasonably optimised, but can do a LOT more than the simpler functions and operators. As such it uses more processor resources.

    UPDATE Just realized you were comparing it with /foo/ and /bar/, which are still regex's. Well, it is possible that the regex engine can optimize a regex with a constant matching expression (like /foo/) to the point where it is much faster than alternation (/(foo|bar)/). To really be able to answer that question, you'd need to know a fair bit about the internals of the regex engine.

    UPDATE 2 See Roy Johnson's reply below. His benchmark verifies my guess :)

Re: Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ???
by samy_kumar (Scribe) on Feb 18, 2005 at 13:08 UTC
    Hi, I think the problem is with this line.
    my $regex = qr /(?:$regexstr)/;
    because perl compiles this patern again & again in the foreach loop.
    foreach my $stringlength (100,1000,10000,100000)
    If you use this pattern
    my $regex = qr /(?:$regexstr)/o;
    instead of
    my $regex = qr /(?:$regexstr)/;
    It will be fast. Since it compiles the pattern only once.
      No, Perl compiles the regex once where it is defined with qr.

      Caution: Contents may have been coded under pressure.
Re: Regex combining /(foo|bar)/ slower than using foreach (/foo/,/bar/) ???
by kscaldef (Pilgrim) on Feb 19, 2005 at 05:42 UTC

    I think that your proposed optimization isn't as straightforward as you think. Consider:

    $string = "fooar" $string =~ s/(foo|bar)/b/g; print $string;

    vs.

    $string = "fooar" $string =~ s/$_/b/g for (qr/foo/, qr/bar/); $print $string;
      exactly. Problem for me is, that if I've got a soulution in the language I use it as it is hopefuly perfectly optimized.

      It feels like on those old VAX systems where you could program in assembler some operations faster with simpler ops. One of the reasons RISC was introduced.

      If Perl gives me the possiblity of using a 'nice' small solution (/foo|bar/) I always thought I should be able to program a faster solution with a perl PROGRAM ( the for loop ).

      But that's just propably my understanding of how things should work.

      Jolly - thanx for u'r time.

        If Perl gives me the possiblity of using a 'nice' small solution (/foo|bar/) I always thought I should be able to program a faster solution with a perl PROGRAM ( the for loop ).

        I wouldn't think that. At least, not if you're smart about language design. With good semantics, you provide the optimizer more information about what you're trying to do, thus allowing it to make better optimization. Unfortuately, Perl5 has hopelessly bad semantics, so this principle is mostly lost for us.

        "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.