Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Example of perluniintro

by remiah (Hermit)
on Aug 18, 2012 at 03:07 UTC ( #988133=perlquestion: print w/replies, xml ) Need Help??
remiah has asked for the wisdom of the Perl Monks concerning the following question:

I read perluniintro and tried the examples of "How Do I Convert Binary Data Into a Particular Encoding, Or Vi ce Versa?" with multibyte character. And I think the author confusing for "native string" or "bytes" ... I think. It will make sense with letters less than 128 code point, but when I tried with a letter like 'HIRAGANA LETTER A', it doesn't make sense. In short, examples seems to me, it is forgetting "encoding to bytes".

"A" is 0x41 for bytes and 0x41 for code point.
"HIRAGANA LETTER A is 0xe3,0x81,0x82 for bytes and 0x3042 for codepoint.

#hex dump of A #00000000 41 |A| #00000001
#hex dump of HIRAGANA LETTER A #00000000 e3 81 82 |...| #00000003

And two example codes below.

#Example 1: native string may not be native string #Code: $native_string=pack('W*', unpack('U*', $unicode_string)); use strict; use warnings; use Encode qw(encode); use Devel::Peek; use 5.012; my($code_point,$unicode_string,$native_string, $native_string2); $code_point=0x41;#"A"; $unicode_string=pack('U*', $code_point); $native_string=pack('W*', unpack('U*', $unicode_string)); Dump $unicode_string; Dump $native_string; # ==> here it is not UTF-8 flagged $code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); $native_string=pack('W*', unpack('U*', $unicode_string)); $native_string2=Encode::encode('utf8', $unicode_string); Dump $unicode_string; Dump $native_string; # ==> this is UTF8 flaged may be transparen +tly upgraded because code point > 255 Dump $native_string2;
Devel::Peek shows $native_string is UTF8 flagged and $native_string2 is not UTF-8 flagged in case of HIRAGANA LETTER A.

#Example 2: it is not bytes, it is array of code point. #Code: @bytes=unpack("C*", $unicode_string); use strict; use warnings; use Encode qw(encode); use 5.012; my($code_point,$unicode_string,@bytes); $code_point=0x41;#A $unicode_string=pack('U*', $code_point); @bytes=unpack("C*", $unicode_string); print join('|', @bytes), "\n"; $code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); @bytes=unpack("C*", $unicode_string); print join('|', @bytes), "\n"; #==>these are not bytes ,but array + of codepoints $code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); @bytes=map{ sprintf("%X",$_) } unpack("C*", Encode::encode('utf8', +$unicode_string)); print join('|', @bytes), "\n";

So, I want to hear from monks suggestions, comments or "read this document", anything. I am now reading perlunicode.


Replies are listed 'Best First'.
Re: Example of perluniintro
by Anonymous Monk on Aug 18, 2012 at 03:39 UTC
      Thank you for replay.

      I am looking for confirmation. Whether the author of perluniintro forgets to encode characters to bytes , or I am missing something. What do you think?

        Whether the author of perluniintro forgets to encode characters to bytes , or I am missing something. What do you think?

        I don't think the author forgets something, but I'm not sure what you think the author forgets

        Consider these three lines of output, do you see something wrong with them?

        #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; my $code_point = 0x3042;# HIRAGANA LETTER A aka 12354 my $unicode_string = pack('U*', $code_point); dd 12354 => pack('U*', 12354); dd "UNSIGNED CHARS(W*) ", pack "W*", unpack "U*", $unicode_string.$un +icode_string; dd "UNSIGNED OCTETS(C*) ", unpack "C*", $unicode_string.$unicode_strin +g; __END__ (12354, "\x{3042}") ("UNSIGNED CHARS(W*) ", "\x{3042}\x{3042}") ("UNSIGNED OCTETS(C*) ", 12354, 12354)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://988133]
Approved by Corion
and cookies bake in the oven...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2018-06-19 17:12 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (114 votes). Check out past polls.