http://www.perlmonks.org?node_id=1019024

spspspsp has asked for the wisdom of the Perl Monks concerning the following question:

Hi, How do I make sure the accented characters are interpreted correctly? For example, when I view this URL http://freegeoip.net/xml/202.168.253.162 in my browser, I see accented characters just fine, e.g. <RegionName>Dhāka</RegionName>

However, when I fetch this URL in my Perl script, the accented character comes out garbled. In fact, when I fetch the URL with curl on a Linux shell prompt, it still comes out garbled.

So the question is how do I make sure that I see the accented characters properly in my script? Thank you very much!

Replies are listed 'Best First'.
Re: accented characters are garbled
by Anonymous Monk on Feb 16, 2013 at 09:35 UTC
Re: accented characters are garbled
by Anonymous Monk on Feb 16, 2013 at 12:05 UTC

    I'd wager your first problem is getting your terminal encoding correct. Since the file is UTF-8, and you can't "view" it with curl, your terminal encoding most probably isn't UTF-8. Which terminal are you using? A Linux one or something like PuTTY? Poke around the options a bit.

    (ā is a tricky character: it can't be found in the usual legacy latin encodings. That means you can't translate it to a latin encoding -- your best bet is to get UTF-8 working correctly and forget about playing with other character encodings.)

      Yes, it was the terminal setting. Changing vt100 to xterm shows characters fine. Now, how do I replace accented characters with ascii? E.g. ó to o. I tried the following, but it did not work:
      $city = "Sprîngfíèld"; use utf8; utf8::upgrade($city); utf8::encode($city); print $city;

        So if you got the display working, why do you now want to strip the diacritics?

        Anyway, Text::Unidecode. And while I'm at it, here's the boilerplate code for getting Perl reasonably UTF-8:

        use utf8; # upgrades your strings my $city = "Sprîngfíèld"; binmode(STDOUT, ":encoding(utf-8)"); print $city, "\n"; # use decode_utf8() when reading from e.g. a file # alternatively, see the binmode() call above use Encode 'decode_utf8'; my $input_raw = <STDIN>; my $input = decode_utf8($input_raw); print $input, "\n";
Re: accented characters are garbled
by 2teez (Vicar) on Feb 16, 2013 at 10:50 UTC

      You can check 'use utf8', or binmode.

      1. 'use uft8' enables you to use utf8 characters while you are writing your program, i.e. in your source code--not read utf8 text from an outside source.
      2. binmode() is for reading binary data, i.e. data that consists of single bytes; and binmode turns off newline conversions. The op doesn't want to read binary data, the op wants to read utf8 characters, which can be multiple bytes long; and there is no reason for the op to turn off newline conversions.

        binmode can do more than turning off newline conversion. It can set I/O layers, like :utf8 and :encoding(utf8). See binmode for more details, and the difference between the two layers.

        Of course, it is also possible to setup the required I/O layers directly in open, using the three-arguments form. See open.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)