Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

UTF8 with YAML or JSON

by SBECK (Friar)
on Jun 29, 2012 at 16:08 UTC ( #979143=perlquestion: print w/replies, xml ) Need Help??
SBECK has asked for the wisdom of the Perl Monks concerning the following question:

In one of my modules (Date::Manip) I store a bunch of UTF8 data in a YAML file which I then load into a perl data structure. The basic form looks like this:

#!/usr/bin/perl use strict; use warnings; use YAML::Syck; my @in = <DATA>; my $in = join("",@in); my $dat = Load($in); 1; __DATA__ --- x : &#259;

Note: the &#259 was entered in the question as the UTF8 character ă but inside the code block, it's displayed as above. There's probably some markup I could use to get it to display properly, but I didn't want to spend too much time getting sidetracked from the problem, so just pretend that &#259 and ă are the same.

YAML::Syck has one property that I haven't found in any of the other YAML (or JSON) modules... it doesn't do any handling of UTF8 (converting to perl encoding). What you put in is what you get out, so if you run the above script in the debugger and dump the value of $dat, you get:

DB<1> p Dumper $dat $VAR1 = { 'x' => '&#259;' };

Unfortunately, YAML::Syck is perhaps the least supported of the YAML modules and I'd like to switch to one of the more recent modules. If I change the above script to use YAML or YAML::XS (my preferred module), and then run it in the debugger, I get:

DB<1> p Dumper $dat $VAR1 = { 'x' => "\x{103}" };

i.e. It displays the string as a perl encoding rather than a UTF8 encoding. I'm completely open to the option of converting the YAML to JSON, but the JSON and JSON::XS modules do the same thing. I've tried the following script with similar results:

#!/usr/bin/perl use strict; use warnings; use JSON::XS; my @in = <DATA>; my $in = join("",@in); my $dat = JSON::XS->new->decode($in); my $dat2 = JSON::XS->new->utf8(0)->decode($in); my $dat3 = JSON::XS->new->utf8(1)->decode($in); 1; __DATA__ { "x" : "&#259;" }

Obviously, once the data structure is produced, I could recurse through it and change the perl encodings back to UTF8, but rather than do that, I'll probably just stick with YAML::Syck.

Any suggestions, or do I just stick to YAML::Syck?

Replies are listed 'Best First'.
Re: UTF8 with YAML or JSON
by zentara (Archbishop) on Jun 30, 2012 at 09:13 UTC
    You might try experimenting with the utf8::all module, and see what effect that has on your code. It's perldoc seems to address your problem.

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: UTF8 with YAML or JSON
by tobyink (Abbot) on Jun 30, 2012 at 06:48 UTC

    Instead of looking at Dumper($dat) you might find print length($dat->{x})."\n" enlightening.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: UTF8 with YAML or JSON
by zwon (Abbot) on Jun 29, 2012 at 17:29 UTC
    Generally, if your code contains UTF-8 characters you should add use utf8 pragma (or encoding), otherwise you should call it not ă, but ă.

      The 'use utf8' pragma is actually included in my module (though I omitted it from the simple code I posted here). However, I've played with it quite a bit and never got the results I wanted either.

      In the code posted here, the variable $in DOES contain UTF8 characters, but when you pass it to Load (which is obviously outside the scope of anything in this script) it gets converted. For that reason, though technically correct, I don't think that adding the pragma here will have any impact. If I'm wrong though, I'm certainly open to correction. I"m definitely not a UTF8 expert.

        never got the results I wanted either

        So do you actually want \x{c4}\x{83} as YAML::Syck returns to you, or you want \x{103}?

Re: UTF8 with YAML or JSON
by Your Mother (Chancellor) on Jul 02, 2012 at 13:47 UTC

    I'm mostly just repeating clues already given–

    my @in  = <DATA>;
    my $in  = join("",@in);
    my $dat = Load($in);
    print $dat->{x}, $/;
    print length($dat->{x}), $/;
    x: ă
    moo@cow~>perl -MYAML::Syck pm-979143
    moo@cow~>perl -MYAML::XS pm-979143
    Wide character in print at pm-979143 line 7, <DATA> line 1.
    moo@cow~>perl -CO -MYAML::XS pm-979143

    You can see that only the YAML::XS version is doing UTF-8. YAML::Syck's documentation listed it as deprecated until somewhat recently when it picked up a new maintainer. And JSON(::XS) is also a fine, maybe better, choice. Neither lets you off the hook for knowing what bytes v chars are in play.

Re: UTF8 with YAML or JSON
by linuxkid (Sexton) on Jun 29, 2012 at 17:45 UTC

    When using a YAML module, always use YAML qw/LoadFile DumpFile/; will allow you to get the structure without writing a loop to get the lines into a scalar.


      In this case, the sample code I post is exactly like in the module... the YAML is stored in the __DATA__ section, so LoadFile/DumpFile aren't the routines I want (and Load/Dump are automatically exported, so there's no need to do that.

      That way, I don't have to play games determining where data files live, how they should be installed, etc.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://979143]
Approved by ww
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (10)
As of 2017-12-14 19:16 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (406 votes). Check out past polls.