Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Good day everyone. I've run into a problem with Unicode characters with perl5.6.1.

I'm trying to escape arbitrary ascii characters that people give me through various web forms. These are in turn passed back to the browser and inside of various XML constructs. Inside of those constructs, I need to have UTF-8 compliant stuff.

For instance, I'd like to make:
õ become &3245;
I figure the best way to do that would be to pack the string. Take this code snippet for example:
#!/usr/bin/perl -w use strict; my $foo = "abcdefghijklmnopqrstuvwxyz[]&#246;&#247;"; my $outstr = ""; foreach my $char(split("",$foo)) { if((my $num = unpack("U", $char)) < 125) { $outstr.= $char; } else { $outstr.="\&\#$num\;"; } } print $outstr;
Under perl5.6.x, I receive:
jaybonci@willowisp:~/perl$ ./pack.pl Malformed UTF-8 character (1 byte, need 4) in unpack at ./pack.pl line 9. Malformed UTF-8 character (1 byte, need 4) in unpack at ./pack.pl line 9.
However, under perl 5.8, I recieve the proper output:
abcdefghijklmnopqrstuvwxyz[]&#246;&#247;


Checking perldelta, it mentions changes and improved support for Unicode inside of perl5.8.0, but it strikes me that I don't see how the "U" template of pack would even work under 5.6.1

Griping aside, my first reaction to solve this would be to pack out the characters to 32 bits a piece (I think that is what the warning is getting at). It also occurs to me that the code above sort of works if you pack against "C", but for limited use. With the euro symbol on either platform (€ or &#8364;), neither pack sequence seems to buffer out the bits to be the right way.

So my questions are:
  1. Is this a perl version / build setting problem
  2. The pack perldoc claims that the "U" template is independant of any use utf-8; stuff. Is there a version independant pack/buffer/unpack/repack scheme that given a character, you could tell whether it was upper ascii or not.


Thanks a bunch for any help. I'm beating my head against this one.

    --jb

In reply to Problems with packing upper ASCII - differences across perl versions by JayBonci

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others meditating upon the Monastery: (15)
    As of 2014-10-24 15:34 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      For retirement, I am banking on:










      Results (132 votes), past polls