Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

accented characters are garbled

by spspspsp (Initiate)
on Feb 16, 2013 at 09:19 UTC ( #1019024=perlquestion: print w/replies, xml ) Need Help??
spspspsp has asked for the wisdom of the Perl Monks concerning the following question:

Hi, How do I make sure the accented characters are interpreted correctly? For example, when I view this URL in my browser, I see accented characters just fine, e.g. <RegionName>Dhāka</RegionName>

However, when I fetch this URL in my Perl script, the accented character comes out garbled. In fact, when I fetch the URL with curl on a Linux shell prompt, it still comes out garbled.

So the question is how do I make sure that I see the accented characters properly in my script? Thank you very much!

Replies are listed 'Best First'.
Re: accented characters are garbled
by Anonymous Monk on Feb 16, 2013 at 09:35 UTC
Re: accented characters are garbled
by Anonymous Monk on Feb 16, 2013 at 12:05 UTC

    I'd wager your first problem is getting your terminal encoding correct. Since the file is UTF-8, and you can't "view" it with curl, your terminal encoding most probably isn't UTF-8. Which terminal are you using? A Linux one or something like PuTTY? Poke around the options a bit.

    (ā is a tricky character: it can't be found in the usual legacy latin encodings. That means you can't translate it to a latin encoding -- your best bet is to get UTF-8 working correctly and forget about playing with other character encodings.)

      Yes, it was the terminal setting. Changing vt100 to xterm shows characters fine. Now, how do I replace accented characters with ascii? E.g. to o. I tried the following, but it did not work:
      $city = "Sprngfld"; use utf8; utf8::upgrade($city); utf8::encode($city); print $city;

        So if you got the display working, why do you now want to strip the diacritics?

        Anyway, Text::Unidecode. And while I'm at it, here's the boilerplate code for getting Perl reasonably UTF-8:

        use utf8; # upgrades your strings my $city = "Sprngfld"; binmode(STDOUT, ":encoding(utf-8)"); print $city, "\n"; # use decode_utf8() when reading from e.g. a file # alternatively, see the binmode() call above use Encode 'decode_utf8'; my $input_raw = <STDIN>; my $input = decode_utf8($input_raw); print $input, "\n";
Re: accented characters are garbled
by 2teez (Vicar) on Feb 16, 2013 at 10:50 UTC

      You can check 'use utf8', or binmode.

      1. 'use uft8' enables you to use utf8 characters while you are writing your program, i.e. in your source code--not read utf8 text from an outside source.
      2. binmode() is for reading binary data, i.e. data that consists of single bytes; and binmode turns off newline conversions. The op doesn't want to read binary data, the op wants to read utf8 characters, which can be multiple bytes long; and there is no reason for the op to turn off newline conversions.

        binmode can do more than turning off newline conversion. It can set I/O layers, like :utf8 and :encoding(utf8). See binmode for more details, and the difference between the two layers.

        Of course, it is also possible to setup the required I/O layers directly in open, using the three-arguments form. See open.


        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1019024]
Approved by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2018-04-22 20:45 GMT
Find Nodes?
    Voting Booth?