Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
P is for Practical
 
PerlMonks  

Problems with packing upper ASCII - differences across perl versions

by JayBonci (Curate)
on Nov 29, 2002 at 11:48 UTC ( #216504=perlquestion: print w/ replies, xml ) Need Help??
JayBonci has asked for the wisdom of the Perl Monks concerning the following question:

Good day everyone. I've run into a problem with Unicode characters with perl5.6.1.

I'm trying to escape arbitrary ascii characters that people give me through various web forms. These are in turn passed back to the browser and inside of various XML constructs. Inside of those constructs, I need to have UTF-8 compliant stuff.

For instance, I'd like to make:
õ become &3245;
I figure the best way to do that would be to pack the string. Take this code snippet for example:
#!/usr/bin/perl -w use strict; my $foo = "abcdefghijklmnopqrstuvwxyz[]&#246;&#247;"; my $outstr = ""; foreach my $char(split("",$foo)) { if((my $num = unpack("U", $char)) < 125) { $outstr.= $char; } else { $outstr.="\&\#$num\;"; } } print $outstr;
Under perl5.6.x, I receive:
jaybonci@willowisp:~/perl$ ./pack.pl Malformed UTF-8 character (1 byte, need 4) in unpack at ./pack.pl line 9. Malformed UTF-8 character (1 byte, need 4) in unpack at ./pack.pl line 9.
However, under perl 5.8, I recieve the proper output:
abcdefghijklmnopqrstuvwxyz[]&#246;&#247;


Checking perldelta, it mentions changes and improved support for Unicode inside of perl5.8.0, but it strikes me that I don't see how the "U" template of pack would even work under 5.6.1

Griping aside, my first reaction to solve this would be to pack out the characters to 32 bits a piece (I think that is what the warning is getting at). It also occurs to me that the code above sort of works if you pack against "C", but for limited use. With the euro symbol on either platform (€ or &#8364;), neither pack sequence seems to buffer out the bits to be the right way.

So my questions are:
  1. Is this a perl version / build setting problem
  2. The pack perldoc claims that the "U" template is independant of any use utf-8; stuff. Is there a version independant pack/buffer/unpack/repack scheme that given a character, you could tell whether it was upper ascii or not.


Thanks a bunch for any help. I'm beating my head against this one.

    --jb

Comment on Problems with packing upper ASCII - differences across perl versions
Download Code
Re: Problems with packing upper ASCII - differences across perl versions
by Monky Python (Scribe) on Nov 29, 2002 at 12:26 UTC
    Hi,
    it seems to work for me under perl 5.6 if I add
    use utf8;
    to the code. Try
    perldoc utf8

    MP

      Doesn't work for me.
      Malformed UTF-8 character (unexpected non-continuation byte 0xf7 after + start byte 0xf6) at ./pack.pl line 6. Malformed UTF-8 character (1 byte, need 4) at ./pack.pl line 6. Unrecognized character \xB7 at ./pack.pl line 6.
      According for the perl5.6.1 pack perldoc page it explicitly says it shouldn't matter. This is stock debian 5.6.1 perl.

          --jb
Re: Problems with packing upper ASCII - differences across perl versions
by John M. Dlugosz (Monsignor) on Nov 29, 2002 at 17:40 UTC
    From the error in your reply, it looks like the input is not in UTF8 format at all, but an 8-bit character set. I think you are trying to escape out Latin-1. We covered this in a thread a couple days ago.

    Your Unicode behavior is going to see bytes or chars in split, and that's what's different between 5.6 and 5.8.

    Try using foreach my $char (unpack ("C*", $foo)) (I think that's the right unpack syntax) to force byte seperation of the input, regardless of how the string is tagged (byte or utf8). Then use the if statement you have now as the body of the loop.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://216504]
Approved by Bukowski
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (6)
As of 2014-04-20 09:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls