Did the inefficiency of /i get fixed?

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Did the inefficiency of /i get fixed? by japhy (Canon) on May 19, 2004 at 04:56 UTC
All these posts have used regexes that are too simple. Here's my benchmark (on 5.8.3): use Benchmark 'cmpthese'; my $y = "HELLO world" x 100; cmpthese(-5, { cc_simple => sub { $y =~ /[hH][eE][lL][lL][oO]/; }, cc_complex => sub { $y =~ /\w+ [hH][eE][lL][lL][oO] \w+/; }, i_simple => sub { $y =~ /hello/i; }, i_complex => sub { $y =~ /\w+ hello \w+/i; }, simple => sub { $y =~ /HELLO/; }, complex => sub { $y =~ /\w+ HELLO \w+/; }, simple_fail => sub { $y =~ /hELLO/; }, complex_fail => sub { $y =~ /\w+ hELLO \w+/; }, }); __END__ Rate cc_complex i_complex complex complex_fail simpl +e_fail cc_simple i_simple simple cc_complex 13973/s -- -2% -82% -83% + -87% -99% -99% -99% i_complex 14271/s 2% -- -81% -83% + -87% -99% -99% -99% complex 77103/s 452% 440% -- -7% + -31% -93% -93% -94% complex_fail 82560/s 491% 479% 7% -- + -26% -92% -93% -94% simple_fail 111335/s 697% 680% 44% 35% + -- -90% -90% -92% cc_simple 1065985/s 7529% 7370% 1283% 1191% + 857% -- -5% -23% i_simple 1120558/s 7919% 7752% 1353% 1257% + 906% 5% -- -19% simple 1377591/s 9759% 9553% 1687% 1569% + 1137% 29% 23% -- [download] You can see that, for the very simple regex, the /i regex is faster than the charclass regex -- but this is because, even with case-sensitivity on, the regex `/hello/` is a simple Boyer-Moore search (compare "simple" and "i_simple"). Once you get into a complex regex -- one that requires actual pattern matching -- you can see that /i and charclasses operate about the same (compare "cc_complex" and "i_complex"). _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l] [select]
Re: Re: Did the inefficiency of /i get fixed? by itub (Priest) on May 19, 2004 at 14:22 UTC
The inefficiency is true and very important if you are using Unicode strings, because lowercasing Unicode characters is slow. Consider this modified benchmark: `use utf8; use Benchmark 'cmpthese'; my $y = "HELLO ΑΙΝΣΪ" x 100; # make sure your script is encoded in UTF-8 when you save it! # ... the rest of the code is the same as in the parent node` [download] Results: Rate i_complex cc_complex complex complex_fail i_simple simple_fail simple cc_simple i_complex 672/s -- -68% -70% -71% -98% -100% -100% -100% cc_complex 2078/s 209% -- -9% -10% -94% -99% -100% -100% complex 2278/s 239% 10% -- -2% -94% -99% -100% -100% complex_fail 2317/s 245% 12% 2% -- -94% -99% -100% -100% i_simple 35860/s 5234% 1626% 1474% 1448% -- -79% -95% -95% simple_fail 168298/s 24935% 8001% 7289% 7164% 369% -- -76% -78% simple 703114/s 104489% 33744% 30771% 30248% 1861% 318% -- -10% cc_simple 780570/s 116011% 37472% 34172% 33591% 2077% 364% 11% -- Character classes are 3 times faster than /i for the complex case and 21 times faster for the simple case!	[reply] [d/l]
Re: Did the inefficiency of /i get fixed? by Belgarion (Chaplain) on May 19, 2004 at 04:09 UTC
To amplify what sgifford said: it would appear that the longer the string being matched, the bigger the difference between the two methods. For example, using sgifford's script, but with a `$s` of one character, the numbers on my machine are: `Rate brackets slash i brackets 719424/s -- -7% slash i 775194/s 8% --` [download] While, when `$s` is 16KB long, the difference is more pronounced, with the following: `Rate brackets slash i brackets 3508/s -- -66% slash i 10434/s 197% --` [download] Now the interesting question: sgifford's regex is very simple. What happens when a more complicated test is used. When using the following test code: `#!/usr/bin/perl -w use strict; use Benchmark 'cmpthese'; our $s = "HELLOWORLD" x 1024; cmpthese(1_000_000, { 'slash_i' => sub { $s =~ m/hello/i; $s =~ m/jello +/i }, 'brackets' => sub { $s =~ m/[Hh][Ee][Ll][Ll][Oo]/ +; $s =~ m/[Jj][Ee][Ll][Ll][Oo]/ +; }, });` [download] produce the following results: `Rate brackets slash_i brackets 5633/s -- -66% slash_i 16683/s 196% --` [download] It would appear that the length of the string amplifies the difference more than the complexity of the regular expression. (At least when comparing /[Tt]/ with /t/i.)	[reply] [d/l] [select]
Re: Did the inefficiency of /i get fixed? by sgifford (Prior) on May 19, 2004 at 03:40 UTC
slash-i now seems to be faster: `#!/usr/bin/perl -w use strict; use Benchmark; our $s = "A" x 1024; timethese(1_000_000, { 'slash i' => sub { $s =~ /a/i; $s =~ /b/i; }, 'brackets' => sub { $s =~ /[Aa]/; $s =~ /[Bb]/; }, });` [download] produces: `Benchmark: timing 1000000 iterations of brackets, slash i... brackets: 24 wallclock secs (24.20 usr + 0.00 sys = 24.20 CPU) @ 41322.31/s (n=1000000) slash i: 14 wallclock secs (13.47 usr + 0.00 sys = 13.47 CPU) @ 74239.05/s (n=1000000)` [download]	[reply] [d/l] [select]
Re: Did the inefficiency of /i get fixed? by pbeckingham (Parson) on May 19, 2004 at 03:47 UTC
I'm seeing a 9% improvement in caseless compare over character class. `#!/usr/bin/perl use Benchmark qw(cmpthese); my @samples = ( 'this is a test', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', '_______________T________', '_______________t________'); cmpthese -1 => { class => 'grep (/[Tt]/, @samples)', caseless => 'grep (/t/i, @samples)'}; Rate class caseless class 2299509/s -- -8% caseless 2502283/s 9% --` [download]	[reply] [d/l]
Re: Did the inefficiency of /i get fixed? by davido (Cardinal) on May 19, 2004 at 04:26 UTC
The issue is chronicled in Mastering Regular Expressions (The Owls Book) by Jeoffrey Friedl. I still only have the first edition of MRE, but I dug it out to refresh my memory on the issue. According to Friedl, /i causes a temporary copy to be made of the entire target string. This is in addition to any copies made by the dirty variables like $&. The /i copy is done prior to any match occurring, whereas $& only makes a copy after a successful match. After the copy is made, the RE engine takes a second pass over the string, converting upper case letters to lower case, even if the original was already all lowercase. As the RE itself is compiled, it too has all of its literals converted to lowercase. So you have extra copies being made, target string lowercase conversion, a lowercasing of the RE itself during the compilation phase of the RE, and that all adds up to, as Friedl puts it, "one of the most gratuitous inefficiencies in all of Perl." (He is a RE guy though.) In Friedl's testing, he found worst case scenarios that took over four orders of magnitude with regards to using the /i modifier on strings of 1192395 bytes (1.2MB). He calculated that due to the "needless copying", Perl shuffled more than 647,585MB around inside his CPU. That was a worst case test. In what he considered more of a real-world test, he found that `m/\b[wW][hH][iI][lL][eE]\b/g` resulted in a testrun of fifty times faster than `m/\bwhile\b/gi` on a huge string. His conclusion was, "don't use /i unless you really have to." I've done a quick skim through the various perldelta documents and haven't seen any mention since 5.005 of an improved and more efficient /i modifier. That doesn't mean I couldn't have missed it; there's a lot to skim. Someone may correct me, but it looks like at least for now that part of the RE engine hasn't changed. Dave	[reply] [d/l] [select]
Re: Re: Did the inefficiency of /i get fixed? by Fuzzy Frog (Sexton) on May 19, 2004 at 05:49 UTC
I do have a copy of the second edition. At the top of page 255 Friedl says Somewhat related, users of Perl that read the first edition of this book may sometimes write something like `^[Ff][Rr][Oo][Mm]:` instead of a case-insensitive use of `^from:`. Old versions of Perl were very ineffecient with their case-insensitive matching, so I recommended the use of classes like this in some cases. That recomendation has been lifted, as the case-insensitive inefficiency has been fixed for some years now. I have not read the first edition of the book, so i don't know whether it's worthwhile to upgrade. I can say that Mastering Regular Expressions is one of the most useful and readable reference manuals I have ever seen. It has my highest recomentation. -- Fuzzy Frog My "highest recomendation" may not be much, but it's all I have to give.	[reply] [d/l]
Re: Re: Did the inefficiency of /i get fixed? by diotalevi (Canon) on May 19, 2004 at 12:57 UTC
I wonder sometimes whether anyone things to go read the source to see what the answer is. /i appears to change such nodes as `EXACT` to `EXACTF` which for non-unicode is a "find the first character of the match, then call util.c:Perl_ibcmp which is a dainty little function which relies on a built-in case folding table. There's no implicit lc() and copy to a buffer going on here. If you have a utf8 regular expression then it does what Friedl's book suggests. If you merely have utf8 data then see utf8.c:Perl_ibcmp_utf8 which is long enough to deter me from wanting to read it at the moment but still doesn't appear to make a separate lc()'d copy of the data. This is all from looking at 5.8.3, btw.	[reply]
Re: Did the inefficiency of /i get fixed? by benrwebb (Scribe) on May 19, 2004 at 14:54 UTC
While I am impressed with everyone's research into the relative performance between `m/foo/i` and `m/[fF][oO][oO]/` , Isn't it most important to keep your code readable and maintainable? Unless you are experiencing a specific performance problem, I wouldn't worry too much over it. And if you ARE experiencing a performance problem, I would make sure that the comments were properly verbose: `# This was changed in the interest of performance, the # regex are equivilent in function # /new/i /[nN][eE][wW]/` [download] I'm generally against sacrificing maintainablilty for performance without strong justification, but there are of course those who disagree (and in many situations, they're right)	[reply] [d/l] [select]
Re: Re: Did the inefficiency of /i get fixed? by itub (Priest) on May 19, 2004 at 15:06 UTC
I agree that including comments is a good idea, but the performance difference I mentioned above is really important IMO for any large problem. I once had to apply some regexes a 1 GB UTF-8 file in a case-insensitive way, and that time jumping through hoops in the interest of performance really mattered.	[reply]
Re: Re: Re: Did the inefficiency of /i get fixed? by benrwebb (Scribe) on May 19, 2004 at 17:26 UTC
Exactly my point. If I'm just, oh for instance parsing a command line arg, or some interactive user input, I'm going for readability every time. If I'm in a situation where I'm doing a data transformation between two systems (say an OS390 to UNIX pipe, not that I've ever had to do that :-) ), I'll look at performance more closely (although I'll admit I still write it the quick, easy way first and then optimize after the fact).	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks