Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Why am I having so much trouble and pain with UTF-8 in perl?

by Lightknight (Sexton)
on Oct 01, 2010 at 14:47 UTC ( #862973=perlquestion: print w/ replies, xml ) Need Help??
Lightknight has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to get my set of programs to work properly with UTF-8, and I'm running into all kinds of issues. Isn't UTF-8 is the default everywhere now? In debian and Ubuntu at least it has been for a while, but not so in perl it seems. How can I get perl to DWIM: Use UTF-8 everywhere (unless I e.g. open O, ">:raw", $file or die)?

Here are my furstrating experiences and what I've had to do. I'm hoping I'm missing some important use utf8completely; or something that will make my perl and UTF-8 life easier.

First I discover, that even though my source is written in UTF-8 and that is the default on my system, I still need to use utf8; in every file.

If I want files read/written properly, I also need: use open qw{:encoding(utf8) :std}; or the shorter use open ":locale"; in every .pl or .cgi file. But hey, not before a use Foo; if Foo.pm has a non-utf8 character anywhere in the file (even if it is in a comment) or I get a warning caused by use: utf8 "\xA9" does not map to Unicode.. So that means I now have to be careful to use open ... last or at least after use-ing such libs (most of which I admittedly wrote myself). Just confusing that use-ing a lib that doesn't declare use utf8; still needs to be valid utf-8 nonetheless, don't you think?

So far so good

Now Data::Dumper. perl -e 'use utf8; use open qw(:locale); use Data::Dumper; print Dumper("")' prints out:

$VAR1 = "\x{fc}";

Ok, so this apparently is not considered a bug in Data::Dumper. But it doesn't DWIM (just give me the !!!) I'm writing my code in utf8, so '' is much more handy than "\x{fc}" even if they are equivalent behind the scenes. Especially now that I'm trying to use Data::Dumper to debug my UTF-8 issues. But AHA! $Data::Dumper::Useperl=1 to the rescue: perl -e 'use utf8; use open qw(:locale); use Data::Dumper; $Data::Dumper::Useperl= 1; print Dumper("")' prints out:

$VAR1 = '';

Fantastic even though it is much slower. Now I can trust the output of Data::Dumper (even though a little doubt remains: Why does Useperl=1 produce different output? Will this change in future versions of perl? Oh, never mind...)! Oh, wait; now I have problems with:

  • Log::Log4perl (I needed a line like log4perl.appender.name.utf8=1 in my config file - didn't default to accepting utf-8)
  • Term::ReadLine(::Gnu) I still haven't figured out why the utf8::decode($str) is neccessary here. I thought I told perl that all input and output was to be in utf8 with the use open... line.
    #!/usr/bin/perl -w
    use strict;
    use utf8;
    use Data::Dumper;
    use Encode;
    $Data::Dumper::Useperl=1;
    use Term::ReadLine;
    use open (':locale');
    my $term = Term::ReadLine->new('xmlapish');
    my $str = $term->readline('> ');
    
    # Need one of these or File size below becomes 4. 4!!! The 4 bytes are WRONG in
    # any way you choose to look at it!
    #
    # use Encode; $str = decode_utf8($str);
    utf8::decode($str);
    
    chomp $str;
    print Dumper($str);
    
    open O, ">", "file" or die;
    print O $str;
    close O;
    printf "File size: %d\n", [stat('file')]->[7];
    
    Googling for "Term::ReadLine utf8" or "GNU ReadLine utf8" both give nothing interesting.

And the ('Data::Dumper', 'Log::Log4perl', 'Term::ReadLine') list was just from the hour testing I did after I tried to switch from ISO-8859-1 to UTF-8. There are probably many more UTF-8 issues waiting to be found. I really find it a frustrating experience to use perl and UTF-8 togther, I must say! Am I doing something fundamentally wrong?

Comment on Why am I having so much trouble and pain with UTF-8 in perl?
Select or Download Code
Re: Why am I having so much trouble and pain with UTF-8 in perl?
by moritz (Cardinal) on Oct 01, 2010 at 15:14 UTC
    But it doesn't DWIM (just give me the !!!)

    Then just print it directly, without the Data::Dumper step.

    Data::Dumper is supposed to return perl code that evaluates to the original. If Data::Dumper emitted , it would only work if use utf8 was in scope.

    Term::ReadLine(::Gnu) I still haven't figured out why the utf8::decode($str) is neccessary here. I thought I told perl that all input and output was to be in utf8 with the use open... line.

    Term::ReadLine reads from the terminal, not from any of the standard streams. And you never told Perl about the terminal.... It's probably also the difference between plain read/readline and sysread (which doesn't use any IO layers at all).

    Am I doing something fundamentally wrong?

    No. It's just that Perl 5 doesn't properly expose the difference between binary data and text to the user, so that debugging Encoding mess is always a hassle.

    FWIW I found Devel::Peek useful for debugging non-ASCII strings

    Perl 6 - links to (nearly) everything that is Perl 6.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://862973]
Approved by jettero
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2015-07-05 08:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (61 votes), past polls