Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Unaccenting characters

by mwhiting (Beadle)
on Aug 28, 2013 at 16:31 UTC ( #1051300=perlquestion: print w/ replies, xml ) Need Help??
mwhiting has asked for the wisdom of the Perl Monks concerning the following question:

Hi - I need some help figuring out something which should be fairly simple, in my mind.

I stole some code from a previous post (http://www.perlmonks.org/?node_id=609166) about how to unaccent characters in a string. I don't want to use the Text::Unaccent module, I want to just put in the simplified code suggested by salva in the above article. Here's the code I'm using, modified from his:

my %table = ( '' => 'A', '' => 'A', '' => 'A', '' => 'A', '' => ' +A', '' => 'A', '' => 'C', '' => 'E', '' => 'E', '' => 'E', '' => 'E', '' => 'I', '' => 'I', '' => 'I', '' => 'I', '' => 'N', '' => 'O', '' => 'O', '' => 'O', '' => 'O', '' => ' +O', '' => 'U', '' => 'U', '' => 'U', '' => 'U', '' => 'a', '' => 'a', '' => 'a', '' => 'a', '' => ' +a', '' => 'a', '' => 'c', '' => 'e', '' => 'e', '' => 'e', '' => 'e', '' => 'i', '' => 'i', '' => 'i', '' => 'i', '' => 'n', '' => 'o', '' => 'o', '' => 'o', '' => 'o', '' => ' +o', '' => 'ss', '' => 'u', '' => 'u', '' => 'u', '' => 'u', '' => 'y' ); $str = "Les Misrables"; $str =~ s/([^\x00-\x7F])/$table{'$1'} || '?'/ge; print "str:$str<br>";
Output is:
str:Les Mis?rables
I eliminated the subroutine and the 'shift' command that he had in his code. The code seems to notice that the character is in the right hex range, but it doesn't find the character to replace with.

A second problem: when I use this function on a string coming from the datafile content I will actually be using it on, it replaces it with two question marks, as in: Les Mis??rables. I have seen it convert the accented e to two characters with other methods I have been attempting to use too. Is this something about unicode conversions, using more than one byte to represent something?

Thanks! Michael

Comment on Unaccenting characters
Select or Download Code
Re: Unaccenting characters
by choroba (Abbot) on Aug 28, 2013 at 16:59 UTC
    I noticed several problems:
    1. Single quotes do not interpolate. Use $table{"$1"} or even no quotes at all: $table{$1}.
    2. Tell Perl what encoding your script uses. It should be UTF-8 and you should therefore use utf8;.
    3. If you are reading the data from a file, set the input encoding. You can use either
      open my $IN, '<:utf8', $filename or die $!;

      or

      open my $IN, '<', $filename or die $!; binmode $IN, ':utf8';

      Set the output encoding to UTF-8, too, if you plan to output any accented characters.

    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Hmmm, but I don't know what kind of input I'm getting. I have the 'guess' function running just before this part of the script to determine if I need to encode into UTF8 first or not. Will setting the input encoding to be UTF8 change the input into UTF8, or just tell the server to expect UTF8?
Re: Unaccenting characters
by moritz (Cardinal) on Aug 28, 2013 at 19:06 UTC

    I'm not a big fan of such big tables, so instead I'd propose this:

    use 5.010; use strict; use warnings; use utf8; use Unicode::Normalize qw/NFKD/; sub unaccent { my $s = NFKD shift; $s =~ s/\pM//g; return $s; } say unaccent "Les Misrables"; __END__ Output: Les Miserables

    The NFD normalization form has the base character and the accent split into two different characters, and the substitution removes all the marks (\pM).

    (And Unicode::Normalize is a core module since perl 5.8, and you really, really don't want to use anything older than that for Unicode stuff).

      Thanks, I will try that. What is the 'shift' supposed to do in the code. I know what it does in general, but it was in the original code, and now here, and I don't quite see how it fits in.

Re: Unaccenting characters
by Corion (Pope) on Aug 28, 2013 at 20:53 UTC

    Also consider Text::Unidecode. It has the property of also transliterating Chinese characters. Whether that is wanted is a different question.

Re: Unaccenting characters
by Laurent_R (Vicar) on Aug 28, 2013 at 22:17 UTC

    $str =~ s/(^\x00-\x7F)/$table{'$1'} || '?'/ge;

    Are you planning to write a line like that for each of your accented letters? But then, what is the point of the %table hash? You might as well hard code everything (this is not what I am recommending).

    Depending on how your file is really encoded, the tr/// function might be much easier to use and probably faster. Something like this (to be completed):

    $str =~ tr//aaaceeee/;

    There are a number of cases where this simple tr/// function works well. If not, well, then the Unicode modules described by others.

    There is a last point, though, which I can see as a problem, and which is in fact the main reason why I am posting here. "Unaccenting" letters may be less trivial than you may think. In French, all letters with an accent can be "unaccented" by just taking the same letter without the accent, this is the common way of doing things that when accents are not available. But in German, is usually rendered by ae, by oe and by ue. Similarly, I would tend to believe that the Scandinavian or other languages which have an '' probably don't translate it into an 'a'. This is just a warning that, depending on what exactly you are trying to do, your contemplated solution might be a bit simplistic.

    Update: crossed out the first paragraph following choroba's comment. I had misread the regex.

      Are you planning to write a line like that for each of your accented letters?
      Meditate about the substitution a bit more. Remember negative character classes. There is nothing more to be written.
      لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        Right, I looked at it too quickly and missed the point.

           Depending on how your file is really encoded, the tr/// function might be much easier to use and probably faster. Something like this (to be completed):

           $str =~ tr//aaaceeee/;

      I originally used the tr function like you suggested. Why it didn't work so well, I'm not sure. I had:

      $_ =~ tr//aaaaaceeeeiiiinooooouuuuy/;
      but it produced a two character result, similar to the ?? in my original question above. However, the first output character was an 'a' regardless of what the result was supposed to be, and then an unprintable character placeholder which i can't duplicate here. Like the above code, it recognized which character to replace, but then didn't do it right.

      Thanks for the tip on German/Scandanavian etc languages where they don't translate 1:1, although I'm not so worried about that, at least not at the moment. :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1051300]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (10)
As of 2014-07-22 18:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (126 votes), past polls