Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Convert hex to UTF-8

by frevo (Initiate)
on Sep 27, 2008 at 01:12 UTC ( #713973=perlquestion: print w/ replies, xml ) Need Help??
frevo has asked for the wisdom of the Perl Monks concerning the following question:

I need to scan source code from a 7-bit ascii file and convert hex encodings of Unicode code points into UTF-8 characters. The output should be a correctly encoded UTF-8 string if it contains any code points > 127. The following snippet does not work. Strings 1 and 4 are OK. String 2 is not converted correctly. String 3 is converted correctly! The code points in the range 128..255 are not converted correctly, unless there is a code point > 255 in the same string.

Try it out. The "Dump" statement shows that String 2 is not UTF-8 and the hex characters have not been encoded as UTF-8. You may want to view STDOUT in a UTF-8-aware viewer. PerlMonks' filtering makes it look funny if I include it here.

I know Perl handles 128..255 a little differently, but there must be some workaround.

I have tried variations on Encode::decode and utf8::upgrade to no avail. Any suggestions how to convert the 128..255 characters?

use Devel::Peek; my @strings = ( 'Panic Button', # String 1 'Bot\U00F3n de P\U00E1nico', # String 2 'Bot\U00F3n de P\U00E1nico\U200B', # String 3 '\U041a\U043d\U043e\U043f\U043a\U0430' . ' \U043f\U0430\U043d\U0438\U043a\U0438', # String 4 ); for my $string (@strings){ print STDERR qq(\n$string\n); $string =~ s~ \\U ( [0-9a-fA-F]{4,4} ) ~ chr(hex "0x$1"); ~gex; print "$string\n"; Dump $string; }

Comment on Convert hex to UTF-8
Download Code
Re: Convert hex to UTF-8
by JavaFan (Canon) on Sep 27, 2008 at 02:11 UTC
    utf8::upgrade ought to do the trick. Care to share some code where utf8::upgrade doesn't upgrade the string to UTF-8 format?
Re: Convert hex to UTF-8
by massa (Hermit) on Sep 27, 2008 at 02:17 UTC
    $string = decode_utf8 encode_utf8 $string;
    (right before the print "$string\n";) worked for me. (don't forget to use Encode;)

    Update:
    utf8::upgrade $string;
    also worked for me (Dumped before and after the upgrade):
    Panic Button SV = PV(0x8154b00) at 0x8153c28 REFCNT = 2 FLAGS = (POK,pPOK) PV = 0x816ffd8 "Panic Button"\0 CUR = 12 LEN = 16 SV = PV(0x8154b00) at 0x8153c28 REFCNT = 2 FLAGS = (POK,pPOK,UTF8) PV = 0x816ffd8 "Panic Button"\0 [UTF8 "Panic Button"] CUR = 12 LEN = 16 Bot\U00F3n de P\U00E1nico SV = PVMG(0x81a8af0) at 0x8153d48 REFCNT = 2 FLAGS = (SMG,POK,pPOK) IV = 0 NV = 0 PV = 0x820ad90 "Bot\363n de P\341nico"\0 CUR = 15 LEN = 16 MAGIC = 0x820fb98 MG_VIRTUAL = &PL_vtbl_mglob MG_TYPE = PERL_MAGIC_regex_global(g) MG_LEN = -1 SV = PVMG(0x81a8af0) at 0x8153d48 REFCNT = 2 FLAGS = (SMG,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x81c5ee8 "Bot\303\263n de P\303\241nico"\0 [UTF8 "Bot\x{f3}n d +e P\x{e1}nico"] CUR = 17 LEN = 18 MAGIC = 0x81b1d50 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 15 MAGIC = 0x820fb98 MG_VIRTUAL = &PL_vtbl_mglob MG_TYPE = PERL_MAGIC_regex_global(g) MG_LEN = -1
    []s, HTH, Massa (κς,πμ,πλ)
Re: Convert hex to UTF-8
by ikegami (Pope) on Sep 27, 2008 at 04:05 UTC

    The output should be a correctly encoded UTF-8

    That's got nothing to do with how the string is stored internally.

    You haven't shown where you encode the output, so I presumed you didn't. That's a bug. Don't you see the warnings? Once I add any of the following, it correctly outputs UTF-8.

    use open ':std', ':utf8';
    use open ':std', ':locale'; # If you use a UTF-8 locale
    binmode STDOUT, ':utf8'; binmode STDERR, ':utf8';
Re: Convert hex to UTF-8
by Anonymous Monk on Feb 10, 2014 at 10:15 UTC
    \u0438

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://713973]
Approved by grep
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2014-11-23 17:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (134 votes), past polls