Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

"ISO-8859-1 0x80-0xFF" and chr()

by remiah (Hermit)
on Mar 23, 2012 at 10:50 UTC ( #961193=perlquestion: print w/ replies, xml ) Need Help??
remiah has asked for the wisdom of the Perl Monks concerning the following question:

I lived my life decoding input bytes, makes it as a character, and print encoding it as bytes. My life was as belows.

#!/usr/bin/perl use strict; use warnings; use Encode qw/encode decode/; my ($byte,$decoded); #get bytes $byte=`perl -MEncode -e "print encode('UTF-8',chr(hex('00E9')))"`; #decode bytes to char $decoded=decode('UTF-8', $byte); #encode char to byte for print print encode('UTF-8', $decoded)
This prints "". Yesterday I stumbled with ISO-8859-1 0x80-0xFF problem. Code below prints "�" (replacement characer). This confused me.
#!/usr/bin/perl use strict; use warnings; use Encode qw/encode decode/; my ($byte,$decoded); #$byte=chr(hex('0041')); $byte=chr(hex('00E9')); #decode bytes to char $decoded=decode('UTF-8', $byte); #encode char to byte for print print encode('UTF-8', $decoded);
There were two thing that I didn't understand and confuesed.
1. chr() returns characeter not bytes.(silly me)
2. There needs some care for "ISO-8859-1 0x80-0xFF" characters.

I have to "upgrade" the results of chr() when the character is "ISO-8859-1 0x80-0xFF". So, If I want to go back to ordinary life of decode and encode, I have to do like this.
#!/usr/bin/perl use strict; use warnings; use Encode qw/encode decode/; my ($chr,$decoded); $chr=chr(hex('00E9')); #it is already not bytes but character. #use utf8::upgrade from "native encoding" to "UTF-8 encoding"(perl int +ernal) utf8::upgrade($chr); #now you can encode char to byte for print print encode('UTF-8', $chr);
This prints "". I have read http://en.wikibooks.org/wiki/Perl_Programming/Unicode_UTF-8 and this page told me "dbd drivers must be clever than me". It may be true if I am clever enough to ask them properly for decoding...

I kick perl to get bytes of "". But I guess there must be more elegant way to get bytes. Tomorrow, I should refer pack(). Good night.

Comment on "ISO-8859-1 0x80-0xFF" and chr()
Select or Download Code
Replies are listed 'Best First'.
Re: "ISO-8859-1 0x80-0xFF" and chr()
by moritz (Cardinal) on Mar 23, 2012 at 12:45 UTC
    1. chr() returns characeter not bytes.(silly me)

    While "bytes" and "characters" is a useful mental image, it's not always correct. The operation defines the context. For example uc interprets a string as text no matter what, whereas print interprets a string as bytes (if it can)

    The real problem is that the byte 0xe9 cannot be decoded as UTF-8, because it isn't UTF-8. Either do nothing with it (which works on sufficiently modern perls), or decode it as Latin-1, because Latin-1 (aka ISO-8859-1) maps each byte exactly to the same codepoint number.

    Note that instead of calling encode() on each output string, you can also set an IO layer which does it automatically:

    binmode STDOUT, ':encoding(UTF-8)';

    Or on the command line, you can set that up with the -C option:

    $ perl -CS -wE 'say chr hex "E9"'

      Thanks for reply, moritz.

      I was careless for "utf8" and "UTF-8" before I read that document. moritz seems to be careful person. And -CS option very usuful.

Re: "ISO-8859-1 0x80-0xFF" and chr()
by choroba (Canon) on Mar 23, 2012 at 12:28 UTC
    Where does your problematic string come from? If it comes from the code,
    use utf8;
    If the encoding is different, you can replace utf8 with encoding('iso-8859-2') etc.

    If the string comes from a filehandle,

    open my $FH, '<:utf8', ...
    If the encoding is different, you can replace utf8 with encoding(iso-8859-2) etc.

    If the string comes from a DBI, your driver might support encoding (for example, Postgres's connect supports pg_enable_utf8 attribute.

    And so on.

      Thanks for reply.

      Problematic string came from chr(). I didn't know I can paste '' at PerlMonk, I tried to create it with chr(hex()). And I stumbled.

      The OP of this thread Bug in Template? said he decode with database driver and print it in Template with like this.

      my $t =Template->new(); $t->process("his.tmpl", {lines=>\@vars}, "output.html" ) or die $t->error();
      Template wants encoded bytes, not decoded characters. This prints "#�#".
      #!/usr/bin/perl use strict; use warnings; use Encode qw(decode encode); use Template; my($a,$decoded); #input bytes to $a $a=`perl -CS -e "use utf8;print ''"`; #decode it to character $decoded=decode('UTF-8', $a); #this will print replacement character to test_out1.html my $t=Template->new(); $t->process("test.tmpl",{a=>$decoded},"test_out1.html");
      And below is Template for that.
      <html> <head> <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8" +> </head> <body> #[% a %]# </body> </html>
      Encode $a to bytes will work.
      #!/usr/bin/perl use strict; use warnings; use Encode qw(decode encode); use Template; my($a,$decoded,$encoded); #input bytes to $a $a=`perl -CS -e "use utf8;print ''"`; #decode it to character $decoded=decode('UTF-8', $a); $encoded=encode('UTF-8', $decoded); #this is good my $t=Template->new(); $t->process("test.tmpl",{a=>$encoded},"test_out2.html");
      There seems huge confusion around Template Tool Kit's Encoding problem here in Japan. My conclusion so far: "pass encoded bytes to Template, not decoded character".

        Hmm
        $ perldoc template |grep -i utf8 -C2 Alternately, the "binmode" argument can specify a particular IO la +yer such as ":utf8". $tt->process($infile, $vars, $outfile, binmode => ':utf8') || die $tt->error(), "\n";

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://961193]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (7)
As of 2015-08-01 00:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (285 votes), past polls