http://www.perlmonks.org?node_id=1112723

pcouderc has asked for the wisdom of the Perl Monks concerning the following question:

When I use :
#!/usr/bin/perl # use strict; use XML::Twig; #my $twig=XML::Twig->new(pretty_print => 'indented', keep_encoding => +1); my $twig=XML::Twig->new(pretty_print => 'indented'); $twig->parse( '<?xml version="1.0" encoding="UTF-8"?><myxml/>'); my $root= $twig->root; $root->set_att( 'fille' => 'clémence'); open(PF, "> out.xml") or die "can't open file $!\n"; $twig->print(\*PF);
I get :
cat out.xml <?xml version="1.0" encoding="UTF-8"?> <myxml fille="clémence"/
So my question : who has corrupted Clémence ? What have the done her? I know that this subject has soon occured here but without a clear answer.

Replies are listed 'Best First'.
Re: Lost in encoding in Twig
by hippo (Chancellor) on Jan 09, 2015 at 12:17 UTC

    Ah, unicode. Lucky you!

    If you have not done so already, I recommend reading perlunitut as an introduction to the subject - lots covered there. If you are in a rush to solve this particular problem, try use utf8 (because of your use of literals in the source code) and binmode (for your file-based output).

      The recommendations by pcouderc are the usual ones, and I would actually code the script that way. But the fact is that when I run your script as originally posted and cat the output I get '<myxml fille="clémence"/>'. Have you looked at xml.out with 'hexdump -C'? If you see 'clémence' encoded as '63 6c c3 a9 6d 65 6e 63 65' it means your Perl output is correct. But if your command shell is set to interpret output as ISO Latin-1 you would get 'clémence' even if your script is doing the right thing.

        Thank you both.
        I do not think I need utf8 as my perl is v5.18 (debian jessie).

        Hexdump on my code.pl is correct c3 a9.
        Hexdump on my out.xml is incorrect and correspond to my cat : c3 83 c2 a9 (é).
        Sorry, I do not know how to check if my "command shell is set to interpret output as ISO Latin". But my locale is : LANG=en_US.UTF-8.
        I can use "keep-encoded" to make it work (in this case), but I would like to understand...
Re: Lost in encoding in Twig
by Krambambuli (Curate) on Jan 09, 2015 at 16:46 UTC
    Try this:
    #!/usr/bin/perl use strict; use XML::Twig; use Encode qw( decode ); my $twig=XML::Twig->new(pretty_print => 'indented'); $twig->parse( '<?xml version="1.0" encoding="UTF-8"?><myxml/>'); my $root= $twig->root; $root->set_att( 'fille' => decode('UTF-8', 'clémence') ); open(PF, ">", "out.xml") or die "can't open file $!\n"; $twig->print(\*PF);

    Krambambuli
    ---
      Thank you all for your help.
      My conclusions are:
      - utf8 is not so generalized that I did hope, certainly for 2 reasons, the weight of the past and all lines of code soon written. And certainly too because it is more a problem for non speaking english people.
      - there is not default utf8 at global perl level (and it could be dangerous)
      - an excellent default at scope level, and then at script level is:
      use utf8::all
Re: Lost in encoding in Twig
by Anonymous Monk on Jan 09, 2015 at 14:22 UTC
    So my question : who has corrupted Clémence ?
    As far as I can tell, XML::Twig::print did. It seems to me (from your post) the problem goes away with keep_encoding => 1?
    sub print { ... if($perl_version > 5.006 && ! $t->{twig_keep_encoding}) { if( grep /useperlio=define/, `$^X -V`) { binmode( $fh || \*STDOUT, ":utf8" ); } } ...
    Since you didn't use utf8; in your code, Clémence is not utf-8 (for Perl's purposes). Printing it to utf-8 filehandle will wreck Clémence.

    Also I don't understand what utf-8 has to do with Perl 5.018 in particular

      Thank you,
      I thought to have seen somewhere that since something like 5.16 use utf8 wad the default.
        I thought to have seen somewhere that since something like 5.16 use utf8 wad the default.
        Unfortunately, no.

        from your other post

        do you want to tell that for each operation in perl, and particularly for IO, I MUST specify utf8?
        Yes.
        Is there today another coding than utf8?
        Yes, but utf-8 surely would be a more useful default.
        Is not it possible to specify that all is utf8 (unless exceptions) at computer level ? or at least at perl level, or at least at script level ?
        Perhaps utf8::all will be suitable for you.