Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Did the inefficiency of /i get fixed?

by Cody Pendant (Prior)
on May 19, 2004 at 03:21 UTC ( #354506=perlquestion: print w/replies, xml ) Need Help??
Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm just reading the doco for HTML::Template
Q: Why do you use /[Tt]/ instead of /t/i? It's so ugly!

A: Simple - the case-insensitive match switch is very inefficient. According to _Mastering_Regular_Expressions_ from O'Reilly Press, /[Tt]/ is faster and more space efficient than /t/i - by as much as double against long strings. //i essentially does a lc() on the string and keeps a temporary copy in memory. When this changes, and it is in the 5.6 development series, I will gladly use //i. Believe me, I realize /[Tt]/ is hideously ugly.

What's the story with that whole thing? I certainly read that in the Regex book ("The Owl Book"?) ages ago, but have been proceding on the assumption it got fixed, but I realise now that I've never seen it discussed.

Has it been fixed? If not, why not?

If it has, shouldn't modules and a bunch of other stuff have undergone a revision as a result?

=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: Did the inefficiency of /i get fixed?
by japhy (Canon) on May 19, 2004 at 04:56 UTC
    All these posts have used regexes that are too simple. Here's my benchmark (on 5.8.3):
    use Benchmark 'cmpthese'; my $y = "HELLO world" x 100; cmpthese(-5, { cc_simple => sub { $y =~ /[hH][eE][lL][lL][oO]/; }, cc_complex => sub { $y =~ /\w+ [hH][eE][lL][lL][oO] \w+/; }, i_simple => sub { $y =~ /hello/i; }, i_complex => sub { $y =~ /\w+ hello \w+/i; }, simple => sub { $y =~ /HELLO/; }, complex => sub { $y =~ /\w+ HELLO \w+/; }, simple_fail => sub { $y =~ /hELLO/; }, complex_fail => sub { $y =~ /\w+ hELLO \w+/; }, }); __END__ Rate cc_complex i_complex complex complex_fail simpl +e_fail cc_simple i_simple simple cc_complex 13973/s -- -2% -82% -83% + -87% -99% -99% -99% i_complex 14271/s 2% -- -81% -83% + -87% -99% -99% -99% complex 77103/s 452% 440% -- -7% + -31% -93% -93% -94% complex_fail 82560/s 491% 479% 7% -- + -26% -92% -93% -94% simple_fail 111335/s 697% 680% 44% 35% + -- -90% -90% -92% cc_simple 1065985/s 7529% 7370% 1283% 1191% + 857% -- -5% -23% i_simple 1120558/s 7919% 7752% 1353% 1257% + 906% 5% -- -19% simple 1377591/s 9759% 9553% 1687% 1569% + 1137% 29% 23% --
    You can see that, for the very simple regex, the /i regex is faster than the charclass regex -- but this is because, even with case-sensitivity on, the regex /hello/ is a simple Boyer-Moore search (compare "simple" and "i_simple"). Once you get into a complex regex -- one that requires actual pattern matching -- you can see that /i and charclasses operate about the same (compare "cc_complex" and "i_complex").
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      The inefficiency is true and very important if you are using Unicode strings, because lowercasing Unicode characters is slow. Consider this modified benchmark:

      use utf8; use Benchmark 'cmpthese'; my $y = "HELLO ΑΙΝΣΪ" x 100; # make sure your script is encoded in UTF-8 when you save it! # ... the rest of the code is the same as in the parent node


                       Rate i_complex cc_complex complex complex_fail i_simple simple_fail simple cc_simple
      i_complex       672/s        --       -68%    -70%         -71%     -98%       -100%  -100%     -100%
      cc_complex     2078/s      209%         --     -9%         -10%     -94%        -99%  -100%     -100%
      complex        2278/s      239%        10%      --          -2%     -94%        -99%  -100%     -100%
      complex_fail   2317/s      245%        12%      2%           --     -94%        -99%  -100%     -100%
      i_simple      35860/s     5234%      1626%   1474%        1448%       --        -79%   -95%      -95%
      simple_fail  168298/s    24935%      8001%   7289%        7164%     369%          --   -76%      -78%
      simple       703114/s   104489%     33744%  30771%       30248%    1861%        318%     --      -10%
      cc_simple    780570/s   116011%     37472%  34172%       33591%    2077%        364%    11%        --

      Character classes are 3 times faster than /i for the complex case and 21 times faster for the simple case!

Re: Did the inefficiency of /i get fixed?
by Belgarion (Chaplain) on May 19, 2004 at 04:09 UTC

    To amplify what sgifford said: it would appear that the longer the string being matched, the bigger the difference between the two methods. For example, using sgifford's script, but with a $s of one character, the numbers on my machine are:

    Rate brackets slash i brackets 719424/s -- -7% slash i 775194/s 8% --

    While, when $s is 16KB long, the difference is more pronounced, with the following:

    Rate brackets slash i brackets 3508/s -- -66% slash i 10434/s 197% --

    Now the interesting question: sgifford's regex is very simple. What happens when a more complicated test is used. When using the following test code:

    #!/usr/bin/perl -w use strict; use Benchmark 'cmpthese'; our $s = "HELLOWORLD" x 1024; cmpthese(1_000_000, { 'slash_i' => sub { $s =~ m/hello/i; $s =~ m/jello +/i }, 'brackets' => sub { $s =~ m/[Hh][Ee][Ll][Ll][Oo]/ +; $s =~ m/[Jj][Ee][Ll][Ll][Oo]/ +; }, });

    produce the following results:

    Rate brackets slash_i brackets 5633/s -- -66% slash_i 16683/s 196% --

    It would appear that the length of the string amplifies the difference more than the complexity of the regular expression. (At least when comparing /[Tt]/ with /t/i.)

Re: Did the inefficiency of /i get fixed?
by sgifford (Prior) on May 19, 2004 at 03:40 UTC
    slash-i now seems to be faster:
    #!/usr/bin/perl -w use strict; use Benchmark; our $s = "A" x 1024; timethese(1_000_000, { 'slash i' => sub { $s =~ /a/i; $s =~ /b/i; }, 'brackets' => sub { $s =~ /[Aa]/; $s =~ /[Bb]/; }, });
    Benchmark: timing 1000000 iterations of brackets, slash i... brackets: 24 wallclock secs (24.20 usr + 0.00 sys = 24.20 CPU) @ 41322.31/s (n=1000000) slash i: 14 wallclock secs (13.47 usr + 0.00 sys = 13.47 CPU) @ 74239.05/s (n=1000000)
Re: Did the inefficiency of /i get fixed?
by pbeckingham (Parson) on May 19, 2004 at 03:47 UTC

    I'm seeing a 9% improvement in caseless compare over character class.

    #!/usr/bin/perl use Benchmark qw(cmpthese); my @samples = ( 'this is a test', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', '_______________T________', '_______________t________'); cmpthese -1 => { class => 'grep (/[Tt]/, @samples)', caseless => 'grep (/t/i, @samples)'}; Rate class caseless class 2299509/s -- -8% caseless 2502283/s 9% --

Re: Did the inefficiency of /i get fixed?
by davido (Archbishop) on May 19, 2004 at 04:26 UTC
    The issue is chronicled in Mastering Regular Expressions (The Owls Book) by Jeoffrey Friedl. I still only have the first edition of MRE, but I dug it out to refresh my memory on the issue.

    According to Friedl, /i causes a temporary copy to be made of the entire target string. This is in addition to any copies made by the dirty variables like $&. The /i copy is done prior to any match occurring, whereas $& only makes a copy after a successful match.

    After the copy is made, the RE engine takes a second pass over the string, converting upper case letters to lower case, even if the original was already all lowercase.

    As the RE itself is compiled, it too has all of its literals converted to lowercase.

    So you have extra copies being made, target string lowercase conversion, a lowercasing of the RE itself during the compilation phase of the RE, and that all adds up to, as Friedl puts it, "one of the most gratuitous inefficiencies in all of Perl." (He is a RE guy though.)

    In Friedl's testing, he found worst case scenarios that took over four orders of magnitude with regards to using the /i modifier on strings of 1192395 bytes (1.2MB). He calculated that due to the "needless copying", Perl shuffled more than 647,585MB around inside his CPU.

    That was a worst case test. In what he considered more of a real-world test, he found that m/\b[wW][hH][iI][lL][eE]\b/g resulted in a testrun of fifty times faster than m/\bwhile\b/gi on a huge string.

    His conclusion was, "don't use /i unless you really have to."

    I've done a quick skim through the various perldelta documents and haven't seen any mention since 5.005 of an improved and more efficient /i modifier. That doesn't mean I couldn't have missed it; there's a lot to skim. Someone may correct me, but it looks like at least for now that part of the RE engine hasn't changed.


      I do have a copy of the second edition. At the top of page 255 Friedl says

      Somewhat related, users of Perl that read the first edition of this book may sometimes write something like ^[Ff][Rr][Oo][Mm]: instead of a case-insensitive use of ^from:. Old versions of Perl were very ineffecient with their case-insensitive matching, so I recommended the use of classes like this in some cases. That recomendation has been lifted, as the case-insensitive inefficiency has been fixed for some years now.

      I have not read the first edition of the book, so i don't know whether it's worthwhile to upgrade. I can say that Mastering Regular Expressions is one of the most useful and readable reference manuals I have ever seen. It has my highest recomentation.

      -- Fuzzy Frog
      My "highest recomendation" may not be much, but it's all I have to give.

      I wonder sometimes whether anyone things to go read the source to see what the answer is. /i appears to change such nodes as `EXACT` to `EXACTF` which for non-unicode is a "find the first character of the match, then call util.c:Perl_ibcmp which is a dainty little function which relies on a built-in case folding table. There's no implicit lc() and copy to a buffer going on here.

      If you have a utf8 regular expression then it does what Friedl's book suggests.

      If you merely have utf8 data then see utf8.c:Perl_ibcmp_utf8 which is long enough to deter me from wanting to read it at the moment but still doesn't appear to make a separate lc()'d copy of the data.

      This is all from looking at 5.8.3, btw.

Re: Did the inefficiency of /i get fixed?
by benrwebb (Scribe) on May 19, 2004 at 14:54 UTC
    While I am impressed with everyone's research into the relative performance between m/foo/i and m/[fF][oO][oO]/ , Isn't it most important to keep your code readable and maintainable? Unless you are experiencing a specific performance problem, I wouldn't worry too much over it. And if you ARE experiencing a performance problem, I would make sure that the comments were properly verbose:

    # This was changed in the interest of performance, the # regex are equivilent in function # /new/i /[nN][eE][wW]/

    I'm generally against sacrificing maintainablilty for performance without strong justification, but there are of course those who disagree (and in many situations, they're right)

      I agree that including comments is a good idea, but the performance difference I mentioned above is really important IMO for any large problem. I once had to apply some regexes a 1 GB UTF-8 file in a case-insensitive way, and that time jumping through hoops in the interest of performance really mattered.

        Exactly my point. If I'm just, oh for instance parsing a command line arg, or some interactive user input, I'm going for readability every time. If I'm in a situation where I'm doing a data transformation between two systems (say an OS390 to UNIX pipe, not that I've ever had to do that :-) ), I'll look at performance more closely (although I'll admit I still write it the quick, easy way first and then optimize after the fact).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://354506]
Approved by pjf
Front-paged by pbeckingham
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (2)
As of 2018-05-21 04:14 GMT
Find Nodes?
    Voting Booth?