Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

JSON, Data::Dumper and accented chars in utf-8

by silentius (Scribe)
on Jan 21, 2012 at 20:39 UTC ( #949179=perlquestion: print w/ replies, xml ) Need Help??
silentius has asked for the wisdom of the Perl Monks concerning the following question:

Venerable Monks,

Please enlighten me, for I am in the dark even after a few hours reading the modules' documentation and experimenting.

I have this code:

while ($line = <IN>) { $strut = from_json($line); print Dumper($strut) . "\n"; }

The file I am reading from contains JSON records and is a text/plain; charset=utf-8 text file.

The problem is some records in that file have keys that have accented chars, namely "Particípio Passado" and when I output those records through the Data::Dumper module they come as:

"Partic\x{c3}\x{ad}pio Passado" => [ 'abasinado' ],

If I do a normal print of the keys, they come right, they come as "Particípio Passado", without the \x{c3}\x{ad} encoding.

Also, with this code: print $strut->{'Particípio Passado'}->[0]."\n"; I can access the records' contents, so the keys seem properly encoded.

But when I output through Data::Dumper they come as "Partic\x{c3}\x{ad}pio Passado", so, venerable monks, my question is: how do I make the records output through Data::Dumper come out as "Particípio Passado" instead of "Partic\x{c3}\x{ad}pio Passado"?

Gratefully, humbly and respectfully I appreciate and thank any enlightenment on this.

Comment on JSON, Data::Dumper and accented chars in utf-8
Select or Download Code
Re: JSON, Data::Dumper and accented chars in utf-8
by ikegami (Pope) on Jan 21, 2012 at 21:00 UTC

    how do I make the records output through Data::Dumper come out as "Particípio Passado" instead of "Partic\x{c3}\x{ad}pio Passado"?

    The string isn't "Particípio Passado"; it's the UTF-8 encoding of "Particípio Passado". If you had "Particípio Passado", Data::Dumper would print one of the following (depending on a couple of factors):

    $VAR1 = "Particípio Passado"; $VAR1 = "Partic\x{ed}pio Passado"; $VAR1 = "Partic\355pio Passado";

    (The first would only show up correctly if you properly encode your output. We'll get back to that.)

    If you're using JSON::XS, this problem would arise if you didn't call ->utf8.

    my $json_parser = JSON::XS->new->utf8; my $data = $json_parser->decode($json_string);

    If I do a normal print of the keys, they come right

    Given that the string is wrong, that indicates there are other problems. (i.e. Two wrongs made a right.)

    First, I bet your Perl source file is encoded using UTF-8, but you didn't tell Perl that using

    use utf8;

    Secondly, I bet you have a UTF-8 terminal, but you didn't tell Perl that using one of

    use open ':std', 'locale'; use open ':std', ':encoding(UTF-8)';

      I have a feeling that Dumper shouldn’t actually produce ambiguous output, so escaped something is what we should expect from it.

      So unless I did something horribly wrong, this is the input (sprinkled with a Unicode character outside Latin-1 range for the sake of example; converted to a UTF-8 encoded byte stream for JSON), the input parsed as JSON, printed to STDOUT on a UTF-8 terminal with your lines added.

      use Data::Dumper; use Encode qw(encode); use JSON; use utf8; use open ":std", ":encoding(UTF-8)"; #use open ":std", ":locale"; ## totally didn't do anything for me my $j = qq/{ "Particípio passad&#337;": 1 }/; my $jp = JSON->new->utf8; my $d = $jp->decode(encode("UTF-8", $j)); print "$j\n"; print Dumper($d); print Dumper($j); print Dumper(encode("UTF-8", $j));
      The output:
      { "Particípio passadő": 1 }
      $VAR1 = {
                "Partic\x{ed}pio passad\x{151}" => 1
              };
      $VAR1 = "{ \"Partic\x{ed}pio passad\x{151}\": 1 }";
      $VAR1 = '{ "Particípio passadÅ": 1 }';
      

      According to the manual, evaling the Dumper output should give us back the original data, so the escaped wide characters in the string seem right to me. Peculiar is how, when given a UTF-8 byte stream, it will not escape things and dump something awkward instead (last line). With $Data::Dumper::Useqq set it produces a better-looking string:

      $VAR1 = "{ \"Partic\303\255pio passad\305\221\": 1 }";
      

        Oh, I do highly disapprove of the site mangling my Unicode character ő inside a code block.

        I have a feeling that Dumper shouldn’t actually produce ambiguous output, so escaped something is what we should expect from it.

        That's why the following should be the default:

        local $Data::Dumper::Useqq = 1;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://949179]
Approved by ikegami
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2014-07-29 08:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (212 votes), past polls