Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

international case insensitive searched with Perl

by mwhiting (Beadle)
on Oct 03, 2011 at 18:31 UTC ( #929400=perlquestion: print w/ replies, xml ) Need Help??
mwhiting has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks - how do I get my Perl regex to recognize international characters as the same in upper or lowercase?

For example if I try to match léger and LÉGER in a case insensitive match, Perl doesn't consider them to match. How do I do this?

Thanks!

Comment on international case insensitive searched with Perl
Re: international case insensitive searched with Perl
by pvaldes (Chaplain) on Oct 03, 2011 at 19:47 UTC

    A quick solution could be to use []

    my $search =~ /AnyPatt[éÉ]rn/i;
Re: international case insensitive searched with Perl
by Anonymous Monk on Oct 03, 2011 at 19:53 UTC
    #!perl use 5.014; use warnings; my $string = 'LÉGER'; print $string =~ /léger/i ? 'matched' : 'no match';

    Prints 'matched' for me.

    ... of course if you're using an older version of perl, you need to put a little more work into enforcing a match under Unicode rules.

    TJD

      More info:

      use feature 'unicode_strings' is the magic that works in 5.14.

      Check out 'The "Unicode Bug"' in  perldoc perlunicode

      TJD

      What kind of extra work is involved to make it match under unicode rules? I'm guessing it's a bit, so I might end up doing an customer-specific workaround to deal with it for them.
        use feature 'unicode_strings' tells perl to assume all strings are unicode.

        use utf8 tell perl to assume all strings in the current source file are unicode.

        Opening input files with a  :encoding layer will tell perl that the resulting strings are unicode.

         use encode contains subs that can be used to mark or convert strings from any other source as/to unicode.

        Hope this helps

        TJD

Re: international case insensitive searched with Perl
by CountZero (Bishop) on Oct 04, 2011 at 05:56 UTC
    In my Strawberry Perl 5, version 12, subversion 1 (v5.12.1), it works, but only if I use utf8;.

    use Modern::Perl; use utf8; my $string = 'LÉGER'; print $string =~ /léger/i ? 'matched' : 'no match';
    Prints "matched".

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      In fact it works with 5.10 just using:

      use strict; use warnings; use utf8; my $string = 'LÉGER'; print $string =~ /léger/i ? 'matched' : 'no match';

      Prints:

      matched
      True laziness is hard work
Re: international case insensitive searched with Perl
by Khen1950fx (Canon) on Oct 04, 2011 at 08:57 UTC
    On my Linux system, I get "no match". Evidently, Perl doesn't consider them a match because they aren't a match. For example:
    #!perl -sl strict; warnings; binmode STDOUT, ':encoding(utf8)'; print my $str1 = "L\311GER"; print my $str2 = "l\351ger";
    That's on 5.8.8; however, on 5.14x and above, Anonymous Monk's suggestion about use feature 'unicode_strings' worked for me.

    Update: fixed typo.

Re: international case insensitive searched with Perl
by mwhiting (Beadle) on Oct 04, 2011 at 15:58 UTC
    Thanks for all your suggestions. The problem I'm running into is that this is a client's ISP & their version of Perl, which must be older (see my response to Anonymous Monk above). I'm not sure what version it is, but the 'use UTF8' and other suggestions don't work, they just give errors. I have sent an email to ask if they would upgrade, but I'm not hopeful knowing how isp's are.
      If you're using the OS installed perl, then an upgrade of it would be unwise. If the ISP has a 2nd install of perl for customer use, great. Otherwise, does your customer have enough space in their account to install their own copy of perl? This could be a great solution. Just don't install it in a directory that is directly accessible by the web server and it's clients.

      TJD

Reaped: Re: international case insensitive searched with Perl
by NodeReaper (Curate) on Oct 05, 2011 at 13:04 UTC
Re: international case insensitive searched with Perl
by DrHyde (Prior) on Oct 06, 2011 at 09:42 UTC
    Perhaps you could give an example of what "international characters" are, and tell us what is a non-international character.

      Usually, people call symbols outside the set of the ASCII characters 32 to 126 "international characters". In the pre-Unicode times, the narrow definition was all characters with a code between 128 and 255, in one or more of the several ASCII extensions (ISO Latin-XXX, machine-specific character sets). Now, the narrow definition is all Unicode characters except those also defined as ASCII 32 to 126. The wider definition is and was always "every character used somewhere at some point in time".

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Usually? No they don't. They usually call them "non-ASCII characters" or "Unicode characters" - the latter being inaccurate because the ASCII characters are also in Unicode.

        And even if people did usually call non-ASCII characters "international characters" it's still inaccurate and therefore not helpful, because "d" is both an ASCII and an international character. You can see how international it is by referring to a French, German, English, Spanish, Vietnamese, Polish etc dictionary.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://929400]
Approved by toolic
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (9)
As of 2014-08-22 07:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (149 votes), past polls