Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^4: JSON::XS Cyrillic unicode not saving properly

by cormanaz (Deacon)
on Mar 21, 2021 at 23:36 UTC ( [id://11130059]=note: print w/replies, xml ) Need Help??


in reply to Re^3: JSON::XS Cyrillic unicode not saving properly
in thread JSON::XS Cyrillic unicode not saving properly

Aha! I have verified that it's not the db transaction by running the code in a visual debugger. After the query @items contains tweets in Cyrillic. I printed those out to a flat file, and it opens Cyrillic. The problem is in JSON::XS.

The docs say you have to use the OO interface and enable utf8 encoding. I tried doing this by changing the print statement to print OUT JSON::XS->new->utf8->encode($sample); but that still produces a json file with ascii characters. The docs on the OO interface are a little confusing. Anyone know the right way to do this?

Replies are listed 'Best First'.
Re^5: JSON::XS Cyrillic unicode not saving properly
by Anonymous Monk on Mar 22, 2021 at 04:40 UTC

    Hi,

    Here you go

    perlunitut: Unicode in Perl#I/O flow (the actual 5 minute tutorial)

    Now ask your program (or mine) , who is doing the byte encoding (what statement, what function/method )?

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw/ dd /; use Path::Tiny qw/ path /; use JSON::XS(); use JSON::PP(); my @humps = "\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}"; dd( JSON::XS->new->pretty(1)->encode( \@humps ) ); dd( JSON::PP->new->pretty(1)->encode( \@humps ) ); dd( JSON::XS->new->utf8(1)->pretty(1)->encode( \@humps ) ); dd( JSON::PP->new->utf8(1)->pretty(1)->encode( \@humps ) ); dd( JSON::XS->new->ascii(1)->pretty(1)->encode( \@humps ) ); dd( JSON::PP->new->ascii(1)->pretty(1)->encode( \@humps ) ); print "#" x 6, "\n"; path( 'deleteme.txt')->spew_raw( JSON::XS->new->pretty(1)->encode( \@h +umps ) ); dd( path( 'deleteme.txt')->slurp_raw ); path( 'deleteme.txt')->spew_raw( JSON::PP->new->pretty(1)->encode( \@h +umps ) ); dd( path( 'deleteme.txt')->slurp_raw ); print "#" x 6, "\n"; path( 'deleteme.txt')->spew_utf8( JSON::XS->new->pretty(1)->encode( \@ +humps ) ); dd( path( 'deleteme.txt')->slurp_raw ); dd( path( 'deleteme.txt')->slurp_utf8 ); path( 'deleteme.txt')->spew_utf8( JSON::PP->new->pretty(1)->encode( \@ +humps ) ); dd( path( 'deleteme.txt')->slurp_raw ); dd( path( 'deleteme.txt')->slurp_utf8 ); print "#" x 6, "\n"; path( 'deleteme.txt')->spew_utf8( JSON::XS->new->utf8(1)->pretty(1)->e +ncode( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); path( 'deleteme.txt')->spew_utf8( JSON::PP->new->utf8(1)->pretty(1)->e +ncode( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); print "#" x 6, "\n"; path( 'deleteme.txt')->spew_utf8( JSON::XS->new->ascii(1)->pretty(1)-> +encode( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); path( 'deleteme.txt')->spew_utf8( JSON::PP->new->ascii(1)->pretty(1)-> +encode( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); print "#" x 6, "\n"; path( 'deleteme.txt')->spew_raw( JSON::XS->new->utf8(1)->pretty(1)->en +code( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); path( 'deleteme.txt')->spew_raw( JSON::PP->new->utf8(1)->pretty(1)->en +code( \@humps ) ); dd( path( 'deleteme.txt')->slurp_utf8 ); # path( 'deleteme.txt')->remove; __END__ "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" "[\n \"\\ufeff\\ud83d\\udc2a one hump two humps \\ud83d\\udc2b\"\n]\ +n" "[\n \"\\ufeff\\ud83d\\udc2a one hump two humps \\ud83d\\udc2b\"\n]\ +n" ###### Wide character in print at C:/perl/site/lib/Path/Tiny.pm line 1848. "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" Wide character in print at C:/perl/site/lib/Path/Tiny.pm line 1848. "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" ###### "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" ###### "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" "[\n \"\xEF\xBB\xBF\xF0\x9F\x90\xAA one hump two humps \xF0\x9F\x90\ +xAB\"\n]\n" ###### "[\n \"\\ufeff\\ud83d\\udc2a one hump two humps \\ud83d\\udc2b\"\n]\ +n" "[\n \"\\ufeff\\ud83d\\udc2a one hump two humps \\ud83d\\udc2b\"\n]\ +n" ###### "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" "[\n \"\x{FEFF}\x{1F42A} one hump two humps \x{1F42B}\"\n]\n" "🐪 one hump two humps 🐫"
    "🐪 one hump two humps 🐫"
      I have no idea what I am supposed to make of that example.
Re^5: JSON::XS Cyrillic unicode not saving properly
by cormanaz (Deacon) on Mar 22, 2021 at 15:33 UTC
    After some more doc-diving, I discovered the problem is that if you use encode_json (which is equivalent to $json_text = JSON::XS->new->utf8->encode ($perl_scalar) AND set the file encoding to utf8, then text gets double-encoded. When I do this
    open(OUT,">twitter-non-en.json") or die "Can't open output: $!"; #binmode OUT, ':encoding(UTF-8)'; print OUT encode_json($sample); close OUT;
    The JSON contains the expected Cyrillic text. Thanks for all the input.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11130059]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-04-26 00:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found