http://www.perlmonks.org?node_id=216504

JayBonci has asked for the wisdom of the Perl Monks concerning the following question:

Good day everyone. I've run into a problem with Unicode characters with perl5.6.1.

I'm trying to escape arbitrary ascii characters that people give me through various web forms. These are in turn passed back to the browser and inside of various XML constructs. Inside of those constructs, I need to have UTF-8 compliant stuff.

For instance, I'd like to make:
õ become &3245;
I figure the best way to do that would be to pack the string. Take this code snippet for example:
#!/usr/bin/perl -w use strict; my $foo = "abcdefghijklmnopqrstuvwxyz[]&#246;&#247;"; my $outstr = ""; foreach my $char(split("",$foo)) { if((my $num = unpack("U", $char)) < 125) { $outstr.= $char; } else { $outstr.="\&\#$num\;"; } } print $outstr;
Under perl5.6.x, I receive:
jaybonci@willowisp:~/perl$ ./pack.pl Malformed UTF-8 character (1 byte, need 4) in unpack at ./pack.pl line 9. Malformed UTF-8 character (1 byte, need 4) in unpack at ./pack.pl line 9.
However, under perl 5.8, I recieve the proper output:
abcdefghijklmnopqrstuvwxyz[]&#246;&#247;


Checking perldelta, it mentions changes and improved support for Unicode inside of perl5.8.0, but it strikes me that I don't see how the "U" template of pack would even work under 5.6.1

Griping aside, my first reaction to solve this would be to pack out the characters to 32 bits a piece (I think that is what the warning is getting at). It also occurs to me that the code above sort of works if you pack against "C", but for limited use. With the euro symbol on either platform (€ or &#8364;), neither pack sequence seems to buffer out the bits to be the right way.

So my questions are:
  1. Is this a perl version / build setting problem
  2. The pack perldoc claims that the "U" template is independant of any use utf-8; stuff. Is there a version independant pack/buffer/unpack/repack scheme that given a character, you could tell whether it was upper ascii or not.


Thanks a bunch for any help. I'm beating my head against this one.

    --jb