Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Case in regular expressions

by RolandGunslinger (Curate)
on Sep 12, 2003 at 17:25 UTC ( [id://291140] : perlquestion . print w/replies, xml ) Need Help??

RolandGunslinger has asked for the wisdom of the Perl Monks concerning the following question:

I am still quite new to Perl, so please forgive my ignorance. Is it possible to write a regular expression such that the case of the search string and the strings being searched doesn't matter. I want the user to be able to specify something to search for and be able to find it regardless of case. The only way I know is to force everything to upper case. Thanks for any advice.

Replies are listed 'Best First'.
Re: Case in regular expressions
by hardburn (Abbot) on Sep 12, 2003 at 17:31 UTC

    Use the i modifier.

    my $foo = 'BaR'; $foo =~ /bar/i;

    See also perlre.

    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    Note: All code is untested, unless otherwise stated

Re: Case in regular expressions
by davido (Cardinal) on Sep 12, 2003 at 18:00 UTC
    The obvious answer is to use the /i modifier at the end of the regular expression to cause the regexp to behave in a "case insensitive" way. You can do this as follows:

    my $string = "This needs to match!"; if ( $string =~ /mAtCh/i ) { print $string, " matched.\n"; }

    In many situations this is the right way to do it.

    But as Friedl points out in Mastering Regular Expressions (the Owls book), the /i modifier can be extremely costly if you're scanning through a lot of text. See the section called, "Perl Efficiency Issues" for details.

    The net result can be minimal on just a line or two of text, but as Friedl illustrates, in searching case insensitively for while m/./gi in a 1MB file (read as a single line), the match with /i took a day and a half to complete, whereas without /i, the match completed in 12 seconds.

    Again, paraphrasing Friedl...
    This was obviously a worst-case scenario. But even in the comparison of m/\bwhile\b/gi against m/\b[Ww][Hh][Ii][Ll][Ee]\b/g, the non-/i version was 50 times faster (though a lot uglier and still pretty inefficient because it prevents Perl's fixed string check and renders study useless).

    My suggestion is that if you have to match on a small string of text in a case-insensitive way, use /i. But if the string is likely to be quite large, and efficiency matters to you, find an alternative to the /i modifier.

    Here is one possible alternative:

    my $string = "Here is some TEXT."; { my $teststr = lc $string; if ( $teststr =~ m/text/ ) { print "$string matched.\n"; } }

    Admittedly this method makes a copy of $string. You could avoid that if you didn't mind converting $string itself to lc or uc. But the point is that the /i operator actually can cause multiple copies of the same string to be made and later discarded. In a worst case scenario, a 1MB string that Friedl used had over 600MB of data being copied around by the regexp engine as it tried to match while applying the /i modifier. In a real-world case, the penalty of using /i is much smaller. But just as we take notice any time $&, $`, and $' are used, take notice whenever you use /i.

    Thanks to Friedl's Mastering Regular Expressions book, we don't all have to test the /i switch on huge files to verify its efficiency; we can take his word for it. He's done all the research on the subject we need.

    /i is a tool, and is there to be used, just as $&, $`, and $'. Clearly its use is not deprecated. But it is a tool that comes at perhaps a higher efficiency cost than unsuspecting users might imagine. Understand the ramifications, and then plan your code accordingly.


    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

      In the second edition of his book, Friedl drily announces that readers of the first edition need not worry any longer about the /i modifier, since the issue has been already fixed.

      Here is a test, showing that the difference, if any, is rather small. Getting less than 0.20 sec difference with ten thousand iterations on a one-million-char string, I would choose the /i modifier any time.

      It all depends on your version of Perl and your machine speed, but if you have a recent release of both, you can safely use the /i modifier without losing much sleep.

      #!/usr/bin/perl -w use strict; use Benchmark qw(timethese); for my $size (10_000, 100_000, 1_000_000) { my $string = 'a' x $size . ' While '; print "string size $size\n", ; my $x; timethese(10_000, { i_modifier => sub { $x = 1 if $string =~ m/\bwhile\b/i; }, char_class => sub { $x = 1 if $string =~ m/\b[Ww][Hh][Ii][Ll][Ee]\b/; } }); } __END__ Perl 5.6.1 ========== string size 10000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 1 wallclock secs ( 0.62 usr + 0.00 sys = 0.62 CPU) i_modifier: 1 wallclock secs ( 0.62 usr + 0.00 sys = 0.62 CPU) string size 100000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 6 wallclock secs ( 6.05 usr + 0.01 sys = 6.06 CPU) i_modifier: 6 wallclock secs ( 6.05 usr + 0.01 sys = 6.06 CPU) string size 1000000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 64 wallclock secs (62.17 usr + 0.31 sys = 62.48 CPU) i_modifier: 63 wallclock secs (61.87 usr + 0.25 sys = 62.12 CPU) ActiveState Perl 5.8.0 (optimized for Pentium architecture) =========================================================== string size 10000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 1 wallclock secs ( 0.51 usr + 0.00 sys = 0.51 CPU) i_modifier: 1 wallclock secs ( 0.52 usr + 0.00 sys = 0.52 CPU) string size 100000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 5 wallclock secs ( 5.12 usr + 0.02 sys = 5.14 CPU) i_modifier: 5 wallclock secs ( 5.23 usr + 0.02 sys = 5.25 CPU) string size 1000000 Benchmark: timing 10000 iterations of char_class, i_modifier... char_class: 51 wallclock secs (50.74 usr + 0.15 sys = 50.89 CPU) i_modifier: 51 wallclock secs (50.91 usr + 0.15 sys = 51.06 CPU)
        Thank you so much for the update. Having upgraded my version of Perl, and my computer several times since first acquiring MRE, it looks like its time to acquire the updated book too. ;)

        Your example my not produce as much of a "Worst Case Scenario" as Friedl's. He scanned a portion of the source code of his version of the 'C' compiler, which at the time was about a 1.1mb file, and certanly 'while' appeared earlier than the last position in the string, and probably appeared multiple times, amid a lot of other line noise and false starts, so to speak. But your example definately does show that the efficiency cost of /i has been greatly reduced. Your effort paid off.

        I do know that the perldocs suggest that the $& penalty has been reduced in its scope and its severity to the point that it's a lot safer to use it. For one thing, it only affects the current regexp. $` and $' apparently are still much more costly.

        I was thinking about the issue more again last night. It seems to me that under the older implementations, where /i was significantly more costly, its cost was roughly exponential to the size of the string it was being used on. Frankly, I have no idea what the actual big "O" notation would be for /i under the old implementations. But if I'm roughly accurate in asserting that the efficiency penalty was exponentially greater as a string grew in size, it makes sense to split strings up into smaller components. If scanning a 1.1mb string took 1.5 days, I'll bet that scanning eleven 100k strings would take only a fraction of that amount of time since the regexp engine simply wouldn't have as much to keep track of in each scan... it wouldn't get as bogged down in its own churning.

        I believe that concept can be more generally applied to regular expressions. It is probably nearly always quicker to match 1mb as ten smaller strings than as one 1mb string, even with the additional overhead of cranking up the engine 10 times. This is all just personal theory, as I have yet to benchmark it. But when I do, I'll post my findings. Obviously there has to be some point at which it's just not beneficial to make the string any smaller. And at some point you also have to say, this is Perl, not hand-optimized machine code. Move on.

        In the first edition of MRE, Friedl did suggest that he had no idea why the /i modifier had to be so costly. It was apparent to him that there was copying (of the string being scanned for matches) that simply didn't need to be there.

        Also missing from the first edition are some of the newer, more experimental Regexp components, such as (?> .... ). I had to turn to the perldocs to figure out what it meant when I saw Abigail II use it the other day in a post.

        Generally, it is safe to refer to the perldocs as the most up to date authority. The problem with respect to Regular Expressions is that Friedl's book is so much better than any of the online documentation, it is tempting to refer to it instead, and this time it tripped me up.

        Thanks again for the update.


        "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

      I read Friedl on that memory issue with /i a couple of years ago and somehow in the back of my mind I thought it would have been fixed by now.

      It's not fixed in any Perl 5 version? Will it be fixed in Perl 6?

      ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print
      You are simply amazing, Dave. I am in awe.
Re: Case in regular expressions
by tcf22 (Priest) on Sep 12, 2003 at 17:32 UTC
    Use the /i modifier.
    $var =~ /PaTtErN/i;

    - Tom