Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Unicode and You

by belg4mit (Prior)
on Aug 18, 2002 at 09:40 UTC ( #190967=perlmeditation: print w/ replies, xml ) Need Help??

... One code to rule them all

Ladies and gentleman of the Monastery of Perl; perl 5.8. If I could offer you, only one tip for the future 5-8 would be it.

I haven't run perl 5.8 through too many production rigors yet, but as some of you may be aware I have been doing quite a bit of Unicode development. And for this, 5.8 wins hands down. If you plan on working with Unicode (or probably any exotic encoding), upgrade. Upgrade upgrade upgrade. Feed your sysadmin horse tranquilizers if you have to, but upgrade, you'll thank yourself later.

In my recent foray into Unicode I've stumbled across two subtle bugs in 5.6. that'd drive you batty if you didn't know they were there. Firstly, while not a bug and it doesn't agree with the documentation it seems sometimes the utf8 pragma is required for UTF-8 strings. With utf8 on in 5.6.0 and 5.6.1 the regexpen s/^\s{1,0}// and s/^\s{0,0}// (and potentially others, those just happen to be what I ran into, and yes of course they are silly regexps, but they were generated on the fly) will consume all leading whitespace. How's that for lovely? The other bug is a fair bit more subtle. When doing something like print join("", map(chr, 0x17d, 0x17e)) in 5.6.0 an extra pair of bytes are printed before the 4 bytes the code creates. The solution appears to be to not do that. Instead, start with a null, or apparently any other ASCII character, or even print join("", "", map(chr, 0x17d, 0x17e)), :-P

This is not to say the trip will be easy, though it may help if you didn't bother to try it in 5.6, Unicode is not an easy thing to get your mind around. Good luck.

--
perl -pew "s/\b;([mnst])/'$1/g"

Comment on Unicode and You
Select or Download Code
Re: Unicode and You
by Courage (Parson) on Aug 18, 2002 at 11:25 UTC
    Could you please enlight a bit more about first bug that you mentioned? It seems like a very weird behaviour.

    Why perl behaves that way? Is it worth reporting via berlbug?

    Courage, the Cowardly Dog
    PS. While I agree with you about benefits of upgrading to 5.8.0 because of better Unicode support, there are some incompatibilities that makes migration harder. (One of examples is that sockets became textmode by default on Win32).

Re: Unicode and You
by crenz (Priest) on Aug 19, 2002 at 13:23 UTC

    Yes, I second your comments. Besides working correctly, perl 5.8 also adds a couple of nifty features. For example, it lets you conveniently set the input and output character sets for a filehandle and will take care of all the necessary encoding for you. And it adds more alphabet/character classes for regexps.

    One thing I disliked about 5.6.1 was that it was impossible to tell it that I want my in- and output as UTF-8. In some situations, it kept on treating my UTF-8-encoded input as raw 8-bit characters and tried to encode them as UTF-8 *again* when printing them to STDOUT... While I could solve my problems, it took me a while to work around it.

    perl 5.6.0 was worse. I did a simple module to convert Chinese traditional characters to simplified ones (Yes, I know there are two on CPAN already, but I had a good reason to do so), using a conversion table in a hash. For whatever reason, 5.6.0 would produce malformed characters, but only in some cases -- on 5.6.1 it works fine.

    Now the only problem I'm facing is... writing my scripts so they will work well (or fail gracefully) with 5.8.0, 5.6.1, 5.6.0 etc...

      >Now the only problem I'm facing is... writing my scripts so they will work well (or fail gracefully) with 5.8.0, 5.6.1,
      >5.6.0 etc...

      That's actually what I'm working on, only I'm keeping the span open for /5\.00\d/ Although I only have to handle input, not output. My solution has been to handle the raw bytes and do the Unicode conversions myself, it seems to work.

      --
      perl -pew "s/\b;([mnst])/'$1/g"

        Well, for my case it would mean I would have to do my own UTF-8 conversion -- which would mean to reimplement a lot of code that's already there with later Perl versions... sounds a bit silly. But I guess it depends on your application.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://190967]
Approved by blakem
Front-paged by rob_au
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (10)
As of 2014-07-30 09:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (230 votes), past polls