Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

"ISO-8859-1 0x80-0xFF" and chr()

by remiah (Hermit)
on Mar 23, 2012 at 10:50 UTC ( [id://961193]=perlquestion: print w/replies, xml ) Need Help??

remiah has asked for the wisdom of the Perl Monks concerning the following question:

I lived my life decoding input bytes, makes it as a character, and print encoding it as bytes. My life was as belows.

#!/usr/bin/perl use strict; use warnings; use Encode qw/encode decode/; my ($byte,$decoded); #get bytes $byte=`perl -MEncode -e "print encode('UTF-8',chr(hex('00E9')))"`; #decode bytes to char $decoded=decode('UTF-8', $byte); #encode char to byte for print print encode('UTF-8', $decoded)
This prints "é". Yesterday I stumbled with ISO-8859-1 0x80-0xFF problem. Code below prints "�" (replacement characer). This confused me.
#!/usr/bin/perl use strict; use warnings; use Encode qw/encode decode/; my ($byte,$decoded); #$byte=chr(hex('0041')); $byte=chr(hex('00E9')); #decode bytes to char $decoded=decode('UTF-8', $byte); #encode char to byte for print print encode('UTF-8', $decoded);
There were two thing that I didn't understand and confuesed.
1. chr() returns characeter not bytes.(silly me)
2. There needs some care for "ISO-8859-1 0x80-0xFF" characters.

I have to "upgrade" the results of chr() when the character is "ISO-8859-1 0x80-0xFF". So, If I want to go back to ordinary life of decode and encode, I have to do like this.
#!/usr/bin/perl use strict; use warnings; use Encode qw/encode decode/; my ($chr,$decoded); $chr=chr(hex('00E9')); #it is already not bytes but character. #use utf8::upgrade from "native encoding" to "UTF-8 encoding"(perl int +ernal) utf8::upgrade($chr); #now you can encode char to byte for print print encode('UTF-8', $chr);
This prints "é". I have read http://en.wikibooks.org/wiki/Perl_Programming/Unicode_UTF-8 and this page told me "dbd drivers must be clever than me". It may be true if I am clever enough to ask them properly for decoding...

I kick perl to get bytes of "é". But I guess there must be more elegant way to get bytes. Tomorrow, I should refer pack(). Good night.

Replies are listed 'Best First'.
Re: "ISO-8859-1 0x80-0xFF" and chr()
by moritz (Cardinal) on Mar 23, 2012 at 12:45 UTC
    1. chr() returns characeter not bytes.(silly me)

    While "bytes" and "characters" is a useful mental image, it's not always correct. The operation defines the context. For example uc interprets a string as text no matter what, whereas print interprets a string as bytes (if it can)

    The real problem is that the byte 0xe9 cannot be decoded as UTF-8, because it isn't UTF-8. Either do nothing with it (which works on sufficiently modern perls), or decode it as Latin-1, because Latin-1 (aka ISO-8859-1) maps each byte exactly to the same codepoint number.

    Note that instead of calling encode() on each output string, you can also set an IO layer which does it automatically:

    binmode STDOUT, ':encoding(UTF-8)';

    Or on the command line, you can set that up with the -C option:

    $ perl -CS -wE 'say chr hex "E9"' é

      Thanks for reply, moritz.

      I was careless for "utf8" and "UTF-8" before I read that document. moritz seems to be careful person. And -CS option very usuful.

Re: "ISO-8859-1 0x80-0xFF" and chr()
by choroba (Cardinal) on Mar 23, 2012 at 12:28 UTC
    Where does your problematic string come from? If it comes from the code,
    use utf8;
    If the encoding is different, you can replace utf8 with encoding('iso-8859-2') etc.

    If the string comes from a filehandle,

    open my $FH, '<:utf8', ...
    If the encoding is different, you can replace utf8 with encoding(iso-8859-2) etc.

    If the string comes from a DBI, your driver might support encoding (for example, Postgres's connect supports pg_enable_utf8 attribute.

    And so on.

      Thanks for reply.

      Problematic string came from chr(). I didn't know I can paste 'é' at PerlMonk, I tried to create it with chr(hex()). And I stumbled.

      The OP of this thread Bug in Template? said he decode with database driver and print it in Template with like this.

      my $t =Template->new(); $t->process("his.tmpl", {lines=>\@vars}, "output.html" ) or die $t->error();
      Template wants encoded bytes, not decoded characters. This prints "#�#".
      #!/usr/bin/perl use strict; use warnings; use Encode qw(decode encode); use Template; my($a,$decoded); #input bytes to $a $a=`perl -CS -e "use utf8;print 'é'"`; #decode it to character $decoded=decode('UTF-8', $a); #this will print replacement character to test_out1.html my $t=Template->new(); $t->process("test.tmpl",{a=>$decoded},"test_out1.html");
      And below is Template for that.
      <html> <head> <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8" +> </head> <body> #[% a %]# </body> </html>
      Encode $a to bytes will work.
      #!/usr/bin/perl use strict; use warnings; use Encode qw(decode encode); use Template; my($a,$decoded,$encoded); #input bytes to $a $a=`perl -CS -e "use utf8;print 'é'"`; #decode it to character $decoded=decode('UTF-8', $a); $encoded=encode('UTF-8', $decoded); #this is good my $t=Template->new(); $t->process("test.tmpl",{a=>$encoded},"test_out2.html");
      There seems huge confusion around Template Tool Kit's Encoding problem here in Japan. My conclusion so far: "pass encoded bytes to Template, not decoded character".

        Hmm
        $ perldoc template |grep -i utf8 -C2 Alternately, the "binmode" argument can specify a particular IO la +yer such as ":utf8". $tt->process($infile, $vars, $outfile, binmode => ':utf8') || die $tt->error(), "\n";

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://961193]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (4)
As of 2024-03-29 09:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found