http://www.perlmonks.org?node_id=416513

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I was wondering if anyone could suggest an elegant way for regular expressions to match all accented forms of letters in a search term. One solution would be to replace any vowels in the search term with all of that character's accented forms. But I was wondering if there was a built-in directive in the regular expressions language that could handle this. Thanks, Jason

Replies are listed 'Best First'.
Re: Regular expressions and accents
by davido (Cardinal) on Dec 21, 2004 at 17:12 UTC

    If you're using locales (which you probably are, if you're dealing with accented characters), Perl's regular expression system is smart enough (usually) to include those accented characters in the \w metacharacter class. You can use this to your advantage. Here's what you need to match:

    Any character that is not a nonword character, that is not a-z or A-Z nor a numeric digit, nor underscore. That's a mouthful, but here's how it's written:

    print "$character\n" if $character =~ m/[^\Wa-zA-Z\d_]/;

    That looks a little ugly, so here's a POSIX version that looks cleaner:

    print "$character\n" if $character =~ m/[^[:^alpha:]a-zA-Z]/;

    These solutions are not thoroughly tested, as I'm currently sitting at an older operating system that doesn't have much in the way of locale support.


    Dave

      If you're using locales
      Or have marked your data as unicode:
      $ perl -we'$x = "\xff"; print 0 + $x =~ /\w/' 0 $ perl -we'$x = "\xff"; utf8::upgrade($x); print 0 + $x =~ /\w/' 1
Re: Regular expressions and accents
by Roy Johnson (Monsignor) on Dec 21, 2004 at 17:16 UTC
    Text::Unaccent might be of some help, depending on what your goal is.

    Caution: Contents may have been coded under pressure.
Re: Regular expressions and accents
by graff (Chancellor) on Dec 22, 2004 at 04:31 UTC
    Larry Wall recently posted this nifty little script on the perl-unicode mail list -- here it is, pretty much verbatim (I added the "S" on the shebang line, to make STDIN/STDOUT/STDERR be utf8):
    #!/usr/bin/perl -CS $pat = shift; if (ord $pat > 256) { $pat = sprintf("%04x", ord $pat); } elsif (ord $pat > 128) { # arg in sneaky UTF-8 $pat = sprintf("%04x", unpack("U0U",$pat)); } @names = split /^/, do 'unicore/Name.pl'; for (@names) { if (/$pat/io) { $hex = hex($_); print chr($hex),"\t",$_; } }
    The idea is to output a list of unicode code points (if any) that match any given expression you put into  $ARGV[0] -- here's a relevant command-line usage example (Larry had this script in a file named "uni"):
    uni "latin (?:small|capital) letter A with"
    (update: if you try this, you'll want to be running in a terminal window that handles utf8 characters!)

    So, all you need for what you want is the part that assigns the output of "unicode/Name.pl" to an array -- this gives you the unicode character database -- and grep through the array to get the set of vowels you want. Then, put the first token (first character in each array element is the utf8 character itself) into a character-class expression. Something like:

    my @names = split/^/, do 'unicore/Name.pl'; #... my @vowelsets; for my $v ( qw/A E I O U/ ) { push( @vowelsets, join( '', map { chr hex( substr $_, 0, 4 ) } grep /LATIN (?:SMALL|CAPITAL) LETTER $v/, @names )); } # now you can use each element of @vowelsets as a character class # (similiarly for consonants...)
    (updated this snippet: changed the map block from a regex to substr; updated a second time to use "chr hex()" in the map block -- each element of @names begins with a four-digit hex code-point value, which needs to be converted to a character.)

    Still a bit cumbersome, I suppose, but quite manageable and not that bulky.

A reply falls below the community's threshold of quality. You may see it by logging in.