Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Good day everyone. I've run into a problem with Unicode characters with perl5.6.1.

I'm trying to escape arbitrary ascii characters that people give me through various web forms. These are in turn passed back to the browser and inside of various XML constructs. Inside of those constructs, I need to have UTF-8 compliant stuff.

For instance, I'd like to make:
õ become &3245;
I figure the best way to do that would be to pack the string. Take this code snippet for example:
#!/usr/bin/perl -w use strict; my $foo = "abcdefghijklmnopqrstuvwxyz[]&#246;&#247;"; my $outstr = ""; foreach my $char(split("",$foo)) { if((my $num = unpack("U", $char)) < 125) { $outstr.= $char; } else { $outstr.="\&\#$num\;"; } } print $outstr;
Under perl5.6.x, I receive:
jaybonci@willowisp:~/perl$ ./ Malformed UTF-8 character (1 byte, need 4) in unpack at ./ line 9. Malformed UTF-8 character (1 byte, need 4) in unpack at ./ line 9.
However, under perl 5.8, I recieve the proper output:

Checking perldelta, it mentions changes and improved support for Unicode inside of perl5.8.0, but it strikes me that I don't see how the "U" template of pack would even work under 5.6.1

Griping aside, my first reaction to solve this would be to pack out the characters to 32 bits a piece (I think that is what the warning is getting at). It also occurs to me that the code above sort of works if you pack against "C", but for limited use. With the euro symbol on either platform (€ or &#8364;), neither pack sequence seems to buffer out the bits to be the right way.

So my questions are:
  1. Is this a perl version / build setting problem
  2. The pack perldoc claims that the "U" template is independant of any use utf-8; stuff. Is there a version independant pack/buffer/unpack/repack scheme that given a character, you could tell whether it was upper ascii or not.

Thanks a bunch for any help. I'm beating my head against this one.


In reply to Problems with packing upper ASCII - differences across perl versions by JayBonci

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    [LanX]: oh a tunnel ...
    LanX ducks!
    marto quacks
    [1nickt]: LanX You flatter yourself. I downvote only posts that I believe do not support and promote Perl. These would include threads berating P5P volunteers and others.

    How do I use this? | Other CB clients
    Other Users?
    Others browsing the Monastery: (9)
    As of 2018-04-26 10:32 GMT
    Find Nodes?
      Voting Booth?