Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^2: UTF8/Unicode Confusion

by jk2addict (Chaplain)
on Mar 20, 2005 at 17:24 UTC ( [id://441034]=note: print w/replies, xml ) Need Help??


in reply to Re: UTF8/Unicode Confusion
in thread UTF8/Unicode Confusion

Assuming I did the right thing...this is without any 'use utf8' or 'utf8::upgrade' magic.

-------------- 5.6.1 -------------- SV = PV(0x14045dc) at 0x1409e8c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x142d9fc "\302\245"\0 CUR = 2 LEN = 3 -------------- 5.8.4 -------------- SV = PV(0x44c3d64) at 0x10590f4 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x450ab24 "\245"\0 CUR = 1 LEN = 2

This is after the uft8:upgrade call:

----------------------- 5.8.4 w/utf8::upgrade ----------------------- SV = PV(0x44f91dc) at 0x104d644 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x4518aa4 "\302\245"\0 [UTF8 "\x{a5}"] CUR = 2 LEN = 3

Replies are listed 'Best First'.
Re^3: UTF8/Unicode Confusion
by dave_the_m (Monsignor) on Mar 20, 2005 at 23:30 UTC
    Well, the Dump outputs show that the function is correctly returning the unicode character 0xa5; it's just that the internal encoding happens not to be utf8. Using utf8::upgrade gets round whatever problem you're having because it converts the internal representation.

    The problem must lie in how you're using the returned value. If for example you're just printing it to STDOUT, and if whatever's listening on STDOUT expects utf8 encoding (eg the terminal), then you need to let Perl know that any output on that file handle should be utf8 encoded, eg

    $ perl -e 'print chr 0xa5'|od -x 0000000 00a5 $ perl -e 'binmode(STDOUT, ":utf8"); print chr 0xa5'|od -x 0000000 a5c2 $
    see perluniintro (in 5.8.x) for more information.

    Dave.

      Well, the Dump outputs show that the function is correctly returning the unicode character 0xa5; it's just that the internal encoding happens not to be utf8

      It does? What am I missing about the second dump, the one from 5.8.4?

      -------------- 5.8.4 -------------- SV = PV(0x44c3d64) at 0x10590f4 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x450ab24 "\245"\0 CUR = 1 LEN = 2

      That looks like perl is tossing away half of the bytes long before I returns it to any output. I don't think it's a problem with how the output is interpreted, just the fact that the output is half as wide as it should be (5.8.4 tossed away the missing \302)

        Perl is not tossing away half the bytes; perl will store characters either as one byte per character (making the character 0x00A5 be represented as "\245" aka "\xa5"), or in utf8 form, with 1-13 bytes per character (with 0x00A5 represented in two characters, "\302\245"). What kind of storage is used is represented by the UTF8 flag, which you will see on after the utf8::upgrade and off prior to it.

        If you have an output filehandle that you want to receive only the utf8 encoding, use binmode as suggested above or perl's -C switch (see perlrun).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://441034]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (7)
As of 2024-03-28 21:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found