Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

accented characters are garbled

by spspspsp (Initiate)
on Feb 16, 2013 at 09:19 UTC ( #1019024=perlquestion: print w/replies, xml ) Need Help??
spspspsp has asked for the wisdom of the Perl Monks concerning the following question:

Hi, How do I make sure the accented characters are interpreted correctly? For example, when I view this URL in my browser, I see accented characters just fine, e.g. <RegionName>Dhāka</RegionName>

However, when I fetch this URL in my Perl script, the accented character comes out garbled. In fact, when I fetch the URL with curl on a Linux shell prompt, it still comes out garbled.

So the question is how do I make sure that I see the accented characters properly in my script? Thank you very much!

Replies are listed 'Best First'.
Re: accented characters are garbled
by Anonymous Monk on Feb 16, 2013 at 09:35 UTC
Re: accented characters are garbled
by Anonymous Monk on Feb 16, 2013 at 12:05 UTC

    I'd wager your first problem is getting your terminal encoding correct. Since the file is UTF-8, and you can't "view" it with curl, your terminal encoding most probably isn't UTF-8. Which terminal are you using? A Linux one or something like PuTTY? Poke around the options a bit.

    (ā is a tricky character: it can't be found in the usual legacy latin encodings. That means you can't translate it to a latin encoding -- your best bet is to get UTF-8 working correctly and forget about playing with other character encodings.)

      Yes, it was the terminal setting. Changing vt100 to xterm shows characters fine. Now, how do I replace accented characters with ascii? E.g. to o. I tried the following, but it did not work:
      $city = "Sprngfld"; use utf8; utf8::upgrade($city); utf8::encode($city); print $city;

        So if you got the display working, why do you now want to strip the diacritics?

        Anyway, Text::Unidecode. And while I'm at it, here's the boilerplate code for getting Perl reasonably UTF-8:

        use utf8; # upgrades your strings my $city = "Sprngfld"; binmode(STDOUT, ":encoding(utf-8)"); print $city, "\n"; # use decode_utf8() when reading from e.g. a file # alternatively, see the binmode() call above use Encode 'decode_utf8'; my $input_raw = <STDIN>; my $input = decode_utf8($input_raw); print $input, "\n";
Re: accented characters are garbled
by 2teez (Priest) on Feb 16, 2013 at 10:50 UTC

      You can check 'use utf8', or binmode.

      1. 'use uft8' enables you to use utf8 characters while you are writing your program, i.e. in your source code--not read utf8 text from an outside source.
      2. binmode() is for reading binary data, i.e. data that consists of single bytes; and binmode turns off newline conversions. The op doesn't want to read binary data, the op wants to read utf8 characters, which can be multiple bytes long; and there is no reason for the op to turn off newline conversions.

        binmode can do more than turning off newline conversion. It can set I/O layers, like :utf8 and :encoding(utf8). See binmode for more details, and the difference between the two layers.

        Of course, it is also possible to setup the required I/O layers directly in open, using the three-arguments form. See open.


        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1019024]
Approved by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2017-04-26 21:47 GMT
Find Nodes?
    Voting Booth?
    I'm a fool:

    Results (490 votes). Check out past polls.