kepler has asked for the wisdom of the Perl Monks concerning the following question:


I'm having a hard time with this subject: I'm trying to tranform a string that might contain accented or other special character(like , , , etc.) but without any luck. I've tryed using Unicode::Normalize

use Unicode::Normalize; use Encode; $str = ""; $string = decode("ISO-8859-1", $str); #windows-1250 $string = NFD($string); $string =~ s/\pM//og;

It works in my laptop, but not in my webserver. I've installed the module (I got no error). In the substitution of "porto de ms", for instance, I get "porto de mas". The accented characters are always substituted by an "a"... Any ideas? I've also tryed Text::Unidecode. But I get even weirder characters...

Kind regards, Kepler

Replies are listed 'Best First'.
Re: Accented characters and others...
by Corion (Pope) on Oct 25, 2015 at 07:32 UTC

    Most likely, what you have is not encoded in the character set you think, then.

    I found Text::Unidecode to work pretty well, provided that its input is correctly encoded UTF-8.

    I recommend that you start with explicitly encoded strings and see if these work for you:

    #!perl use strict; use Text::Unidecode; my $str = "\N{LATIN SMALL LETTER A WITH ACUTE}\N{LATIN SMALL LETTER C +WITH CEDILLA}\N{LATIN SMALL LETTER O WITH GRAVE}"; print unidecode($str); # aco

    If that works for you, you can now take slow steps in the direction of reading data and properly calling Encode::decode() on it, while trying to find the appropriate character set(s) that the data is provided in.

Re: Accented characters and others...
by stevieb (Canon) on Oct 25, 2015 at 04:14 UTC
    I'm not a unicode person, but if this script outputs differently on separate systems, I'd ask you to post the platforms, along with the relevant parts of perl -V on each one. Those who can help may find this version info helpful.
Re: Accented characters and others...
by Anonymous Monk on Oct 25, 2015 at 15:39 UTC
    Use a monitoring tool, maybe WireShark, that can show you the ACTUAL bytes that are being exchanged between client and server ... as in "hexadecimal."

      And then what? What if the bytes correspond exactly to the charset and what is shown, which is whats going to happen. Some code to diagnose? Some ideas for what to expect? Why Wireshark when all the modern browsers have trustworth dev panels or xxd|od + curl|wget give you easier, faster access to the information? The exact hex codes you predict will match? YOU just cant stop yourself. Probably it was another lost session. You will get slightly more upvotes this way. It will contribute to thinking this is a conspiracy. Thatll be wrong.