Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^5: Example of perluniintro

by Anonymous Monk
on Aug 18, 2012 at 07:22 UTC ( [id://988167]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Example of perluniintro
in thread Example of perluniintro

So, why "C" values could become greater than 255? this seems strange...

Its all strange to me, I'm not joking

From http://perldoc.perl.org/5.14.1/functions/pack.html

Pack and unpack can operate in two modes: character mode (C0 mode) where the packed string is processed per character, and UTF-8 mode (U0 mode) where the packed string is processed in its UTF-8-encoded Unicode form on a byte-by-byte basis. Character mode is the default unless the format string starts with U . You can always switch mode mid-format with an explicit C0 or U0 in the format. This mode remains in effect until the next mode change, or until the end of the () group it (directly) applies to.

Using C0 to get Unicode characters while using U0 to get non-Unicode bytes is not necessarily obvious. Probably only the first of these is what you want:

...

Those examples also illustrate that you should not try to use pack/unpack as a substitute for the Encode module.

So trying that I get

dd "UNSIGNED OCTETS(C*) ", unpack "C0C*", $unicode_string.$unicode_str +ing; dd "UNSIGNED OCTETS(C*) ", unpack "U0C*", $unicode_string.$unicode_str +ing; __END__ ("UNSIGNED OCTETS(C*) ", 12354, 12354) ("UNSIGNED OCTETS(C*) ", 227, 129, 130, 227, 129, 130)

So, yes, I think I agree, its a mistake , in that it should probably say You can find the bytes that make up a UTF-8 sequence with:

@bytes = unpack("U0C*", $Unicode_string);

And this seems to confirm that

$code_point=0x3042;#HIRAGANA LETTER A $unicode_string=pack('U*', $code_point); @bytes=map{ sprintf("%X",$_) } unpack("U0C*", $unicode_string); print join('|', @bytes), "\n"; __END__ E3|81|82

update: It says in another part of perluniintro

One way of peeking inside the internal encoding of Unicode characters is to use unpack("C*", ... to get the bytes of whatever the string encoding happens to be, or unpack("U0..", ...) to get the bytes of the UTF-8 encoding:

So yeah, whatever perl's actual internal format that we shouldn't care about is, it is not utf8, and if you want the UTF8 bytes, you need U0C*, otherwise (it looks like) you get IV bytes

Replies are listed 'Best First'.
Re^6: Example of perluniintro
by remiah (Hermit) on Aug 18, 2012 at 08:09 UTC
    Thanks! thanks for your reply!

    This code and your explanation is what I was looking for.

    @bytes = unpack("U0C*", $Unicode_string);
    In perluniintro, C0 and U0 prefix was mentioned several times, but I didn't understand them without explanation like you did.

    It seems you saved me from confusion and from piles of printed pod papers.

    I would like to keep reading unicode and pack documents.
    Again, thanks for your patience.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://988167]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-24 05:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found