JollyJinx has asked for the wisdom of the Perl Monks concerning the following question:
Using a /(foo|bar)/ regex on strings is slower than using a foreach loop doing the matching one after another. I've written a testprogramm and looked at the perl source to find out why. Now I know. It seems that DFA won't get optimised for the alternation.
As I have no time and knowledge and skill for optimising the perlregex compiler from scratch, what can I do. Programming such foreach loops gives me headaches - it such 'awk'ward.
I need to get those regexes fast as nowadays the strings I'm working on tend to get larger ( e.g. xml-files ) - any idea ?
Jolly
Here's the testprogram for those of you that don't think it's true:
#!/bin/perl use strict; use Digest::MD5 qw(md5 md5_hex md5_base64); use Time::HiRes qw(time ); #use re 'debug' ; foreach my $regexcount (1,5,10) { foreach my $regexlength (2,5,10,20) { my @items = map{ createRandomTextWithLength($reg +exlength); } (1..$regexcount); my $regexstr = join('|',@items); my $regex = qr /(?:$regexstr)/; foreach my $stringlength (100,1000,10000,100000) { print localtime()." Stringlength: $stringlengt +h Number of Regexes:$regexcount Length of each Regex:$regexlength\n"; my $teststring = createRandomTextWithLength($s +tringlength); my $timer; { my $test=$teststring; $timer =time; $test =~ s/$regex/foobar/g; printf("ElapsedTime:%5.4f %20s %20s\n",time-$timer,md5_hex($test),$regex); } { my $test=$teststring; $timer =time; foreach my $oneregex (@items) { $test =~ s/$oneregex/foobar/g; + } printf("ElapsedTime:%5.4f %20s %20s\n",time-$timer,md5_hex($test),' for loop over '.join(',',@items)) +; } print "\n"; } } } sub createRandomTextWithLength($) { my($count) = (@_); my $string; for (1.. $count) { $string.=chr(ord('a')+rand(20)); } return $string; }
|
---|