Re^2: JSON, Data::Dumper and accented chars in utf-8

I have a feeling that Dumper shouldn’t actually produce ambiguous output, so escaped something is what we should expect from it.

So unless I did something horribly wrong, this is the input (sprinkled with a Unicode character outside Latin-1 range for the sake of example; converted to a UTF-8 encoded byte stream for JSON), the input parsed as JSON, printed to STDOUT on a UTF-8 terminal with your lines added.

use Data::Dumper;
use Encode qw(encode);
use JSON;
use utf8;
use open ":std", ":encoding(UTF-8)";
#use open ":std", ":locale";  ## totally didn't do anything for me

my $j = qq/{ "Particípio passad&#337;": 1 }/;
my $jp = JSON->new->utf8;
my $d = $jp->decode(encode("UTF-8", $j));

print "$j\n";
print Dumper($d);
print Dumper($j);
print Dumper(encode("UTF-8", $j));
[download]

The output:

{ "Particípio passadő": 1 }
$VAR1 = {
          "Partic\x{ed}pio passad\x{151}" => 1
        };
$VAR1 = "{ \"Partic\x{ed}pio passad\x{151}\": 1 }";
$VAR1 = '{ "ParticÃpio passadÅ": 1 }';

According to the manual, evaling the Dumper output should give us back the original data, so the escaped wide characters in the string seem right to me. Peculiar is how, when given a UTF-8 byte stream, it will not escape things and dump something awkward instead (last line). With $Data::Dumper::Useqq set it produces a better-looking string:

$VAR1 = "{ \"Partic\303\255pio passad\305\221\": 1 }";

Comment on Re^2: JSON, Data::Dumper and accented chars in utf-8 Download Code

Replies are listed 'Best First'.
Re^3: JSON, Data::Dumper and accented chars in utf-8 by ikegami (Patriarch) on Jan 22, 2012 at 06:21 UTC
I have a feeling that Dumper shouldn’t actually produce ambiguous output, so escaped something is what we should expect from it. That's why the following should be the default: `local $Data::Dumper::Useqq = 1;` [download]	[reply] [d/l]
Re^3: JSON, Data::Dumper and accented chars in utf-8 [OFF/Gripe] by Ralesk (Pilgrim) on Jan 21, 2012 at 22:11 UTC
Oh, I do highly disapprove of the site mangling my Unicode character ő inside a code block.	[reply]
Re^4: JSON, Data::Dumper and accented chars in utf-8 [OFF/Gripe] by silentius (Monk) on Jan 21, 2012 at 22:42 UTC
Thank you both for your replies, although they did not solve my problem. I kept searching and solved it simply like this: `use Encode; use Encode::Escape; use Data::Dumper; use JSON; ... while ($line = <IN>) { $strut = from_json($line); print decode('unicode-escape', Dumper($strut)) . "\n"; }` [download] This now gives me the output I need, which is the accented chars displayed as they are, since the output is to be redirected to a text file and read by humans on a regular text editor. Thank you all once again.	[reply] [d/l]
Re^5: JSON, Data::Dumper and accented chars in utf-8 by Ralesk (Pilgrim) on Jan 21, 2012 at 23:04 UTC
And why isn’t the JSON data enough? JSON has a `pretty` option, which would pretty-print the data for you, not requiring you to (ab)use Dumper. Keep in mind, that any JSON data generated by `JSON` is a byte stream and not a (Unicode) string. To print it in a terminal you should decode it into a string and tell Perl your terminal is UTF-8, as suggested above. Sending byte streams to terminals is a bad thing, definitely don’t do that. To print it into a file, you might be best off with a file that you opened as a binary (`open my $FH, '>:raw', "myfilename.txt"`). Any byte thrown at a raw file will appear there as intended. Or of course, you can decode it into a string, print it into the file, Perl will know that it’s a string and either figure out the byte stream format (aka. encoding) for the file, or you should be specifying one, eg. `'>:encoding(UTF-8)'`. Letting Perl do it for you is not necessarily a good idea :) Yes, fully aware it looks extremely complicated :)	[reply] [d/l] [select]
Re^4: JSON, Data::Dumper and accented chars in utf-8 [OFF/Gripe] by ikegami (Patriarch) on Jan 22, 2012 at 06:24 UTC
It's not the site that did that; it's your browser. "ő" doesn't exist in Windows-1252, so your browser decided to send "`ő`" instead. PerlMonks is displaying "`<code>ő</code>`" as "`ő`" as it should.	[reply] [d/l] [select]
Re^5: JSON, Data::Dumper and accented chars in utf-8 [OFF/Gripe] by Ralesk (Pilgrim) on Jan 22, 2012 at 16:23 UTC
Ah, cp1252, how retro! Still, in the end, the process mangles perfectly viable characters that are by no means special to node syntax, and that’s just bad.	[reply]


Keep It Simple, Stupid
	PerlMonks