http://www.perlmonks.org?node_id=298508

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is the captivating title of an article by Joel Spolsky.

The catch line was

When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.

So here are a few meditation themes for the week-end.

  1. How much do you know about Unicode? If you don't, does this lack of knowledge affect you or your work?
  2. Is Perl in much better shape than PHP for Unicode support?

I admit that Unicode has not bothered me that much, and I have survived so far without writing "use utf8" in any of my scripts.

Am I at risk?

  • Comment on Programmers, script languages, and Unicode

Replies are listed 'Best First'.
Re: Programmers, script languages, and Unicode
by allolex (Curate) on Oct 11, 2003 at 15:27 UTC

    The following is from the perluniintro (perldoc) included with 5.8.0:

    Perl's Unicode Support Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode natively. Perl 5.8.0, however, is the first recommended release for serious Unicode work. The maintenance release 5.6.1 fixed many of the problems of the initial Unicode implementation, but for example regular expressions still do not work with Unicode in 5.6.1. Starting from Perl 5.8.0, the use of "use utf8" is no longer necessary. In earlier releases the "utf8" pragma was used to declare that operations in the current block or file would be Unicode-aware. This model was found to be wrong, or at least clumsy: the "Unicodeness" is now carried with the data, instead of being attached to the operations. Only one case remains where an explicit "use utf8" is needed: if your Perl script itself is encoded in UTF-8, you can use UTF-8 in your identifier names, and in string and regular expression literals, by saying "use utf8". This is not the default because scripts with legacy 8-bit data in them would break. See utf8.

    The 5.8.1 maintenance release made a few changes (http://search.cpan.org/src/JHI/perl-5.8.1/pod/perldelta.pod):

    UTF-8 On Filehandles No Longer Activated By Locale

    In Perl 5.8.0 all filehandles, including the standard filehandles, were implicitly set to be in Unicode UTF-8 if the locale settings indicated the use of UTF-8. This feature caused too many problems, so the feature was turned off and redesigned: see "Core Enhancements"

    UTF-8 no longer default under UTF-8 locales

    In Perl 5.8.0 many Unicode features were introduced. One of them was found to be of more nuisance than benefit: the automagic (and silent) "UTF-8-ification" of filehandles, including the standard filehandles, if the user's locale settings indicated use of UTF-8. For example, if you had en_US.UTF-8 as your locale, your STDIN and STDOUT were automatically "UTF-8", in other words an implicit binmode(..., ":utf8") was made. This meant that trying to print, say, chr(0xff), ended up printing the bytes 0xc3 0xbf. Hardly what you had in mind unless you were aware of this feature of Perl 5.8.0. The problem is that the vast majority of people weren't: for example in RedHat releases 8 and 9 the default locale setting is UTF-8, so all RedHat users got UTF-8 filehandles, whether they wanted it or not. The pain was intensified by the Unicode implementation of Perl 5.8.0 (still) having nasty bugs, especially related to the use of s/// and tr///. (Bugs that have been fixed in 5.8.1) Therefore a decision was made to backtrack the feature and change it from implicit silent default to explicit conscious option. The new Perl command line option -C and its counterpart environment variable PERL_UNICODE can now be used to control how Perl and Unicode interact at interfaces like I/O and for example the command line arguments. See perlrun/-C and perlrun/PERL_UNICODE for more information. You can also now use safe signals with POSIX::SigAction. See POSIX/POSIX::SigAction.

Re: Programmers, script languages, and Unicode (ignorance is bliss)
by tye (Sage) on Oct 12, 2003 at 05:20 UTC

    To date, the worst problems I've had with character encoding in the post-Unicode era (when everyone takes ASCII for granted -- I've had nastier problems than these prior to this era) has been because some part of the system knew about character encodings, not because parts of the system were blissfully unaware of Unicode encodings.

    In my experience, most current forms of information interchange don't have clean enough support for getting the character encoding sent through. Most file systems don't track what encoding the text in the file is recorded in. If you read data from a file, pipe, socket, etc. the built-in library calls are not going to give you any help determining what character encoding you are dealing with.

    So, if you want to deal with non-ASCII characters, you usually end up in one of two situations. Either you are immersed in a particular encoding environment and just mostly assume that one encoding (or pick between it and ASCII) and things mostly work well. Or you have to put effort into tracking encoding on one end and effort into conveying encoding to the other end. In both of these situations, it is the pieces in between that are aware of encodings that are likely to cause you problems.

    If I have a piece in the middle that blissfully ignores character encoding and just shuffles the bytes between the part on its left and the part on its right, then I can happily, successfully, correctly pass characters through it in whatever encoding I desire.

    But if I have a part in the middle that wants/expects to know the encoding of the characters, then it is likely to croak when I send characters that aren't encoded as it expected or to "helpfully" translate from one encoding to another as it passes the data between its neighbors in the system.

    For such a middle-layer piece, I then need to communicate to it what encoding is supposed to be used on each side. So instead of picking an encoding and making sure the far end knows which one I picked and being done with the problem, I've got to identify all of the parts in the middle who are encoding-aware and figure out how each of them wants to be informed of encodings (or even if they let me tell them what encoding to use -- they are likely to quite simply insist on UTF-8 and anyone who wants otherwise can go jump in a lake) and then figure out how to get all of these different encodings and notices of encodings to match up so that what comes out the far end is sane.

    Someone needs to define the "encoded character stream" to replace the "byte stream" (that Unix managed to make universally supported) so that every I/O layer can choose to either automatically know about encodings or remain blissfully unaware of them. Until then, adding awareness of Unicode to layers will likely cause more problems in many situations.

    Nope, I don't have the solution. And I understand the problem of finding a place that doesn't support Unicode when you need it to. I'm just noting the trend and making a prediction that things are going to get much worse as they get better.

                    - tye
Re: Programmers, script languages, and Unicode
by William G. Davis (Friar) on Oct 11, 2003 at 15:21 UTC
    Am I at risk?

    Of course not.

    Joel's article went on and on making it seem as though there was nothing but utter chaos before Unicode, then finally, towards the end, he mentions 8 bit ISO Latin-1.

    Latin-1 is a truly monumental achievement. This character set contains all of the letters from the Latin alphabet and then some. Using this one, 8 bit character set we can represent the vast majority of the languages of North America, South America, and Europe.

    If someone in China or the United Arab Emirates tries to use your application full of English dialogs and the dialogs all show up as question marks or whatever, how exactly does Unicode make your program any more usable? OK, now they'll show up as English. Fine. But can the user even read English?

      Although I'll agree that Joel's article made a bigger deal out of Unicode then was needed or is true. Calling Latin-1 a "truly monumental achievement", I think falls into the same category. Latin-1 is a good language if someone in North America, South America, or Western Europe wants a program to be used only in those countries, but if a person from Asia (which includes many more countries then China), the United Arab Emirates, Israel, Eastern Europe, etc... needs to use that program it won't work. Basic thing is if you need a language other then English Unicode is the easiest option. BTW...at least in Japan, a great number of people speak and read/write passiable English

      Now for my review: This article places far to much value on Unicode, being that Unicode will not translate English to Japanese or Japanese to English just display text. Also, he seems to think, or at least say, that Unicode should be used in every application on the possiblity it might be used in a non-English speaking country, see earlier sentence on why this isn't possible. Saying that it does give good information on the use of Unicode and I would recommend people who are writting International Software read it.

      BTW...I started writting letters to a penpal in Japan in sixth grade and have continued that relationship in digital form using email, but in a business sense I have never needed nor used Unicode, but I might need it for Hindi(India) project, though that is put off for a long time

      Updated: I may have to use Unicode at work after all

      "Pain is weakness leaving the body, I find myself in pain everyday" -me

      Monumental indeed. Its usability ends on the former iron curtain. Where are the accentuated characters for Czech, Polish, Slovak, Hungarian, Slovenian and Croatian? (all these do use the Latin alphabet.)

      Jenda
      Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
         -- Rick Osborne

      Edit by castaway: Closed small tag in signature

      Using this one 8 bit character set we can represent the vast majority of the languages of North America, South America, and Europe.

      Perhaps, but consider that the ISO-8859-1 character set does not even contain the symbol for the currencey used by most EU countries, the 'Euro'.

Re: Programmers, script languages, and Unicode
by Courage (Parson) on Oct 11, 2003 at 17:39 UTC
    As of my relationships with Unicode, I am very aware of having Unicode support. It is critical in our department at work (I work for translation department).
    When I forget about this, I receive user messages about incorrectly displayed characters.

    This is most reason I have revived and now support Tcl::Tk module to have robust Unicode-enabled GUI.

    Courage, the Cowardly Dog

Re: Programmers, script languages, and Unicode
by Jenda (Abbot) on Oct 12, 2003 at 15:16 UTC

    You never know when and by what are you going to get caught. Recently we had a big problem with one client. We are getting some data from another company in XML. It's all english text so we did not pay any attention to encoding (UTF8). (Well the import was in ASP+VBScript so there was not much we could do.)

    Everything was fine for most clients, except one. Because they were using MS Word to prepare the data that they then pasted into the other company's system. And MS Word's "Autoformat As You Write" screwed us up. It converted double and single quotes to some "smart quotes" and nice apostrophes. And these were 8bit chars and were encoded as 3 bytes in UTF8.

    Long story short ... I did not find any reasonable way to fix the issue in VBScript so I ended up reimplementing the import in Perl ;-)

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

    Edit by castaway: Closed small tag in signature

Re: Programmers, script languages, and Unicode
by dbush (Deacon) on Oct 12, 2003 at 13:20 UTC

    Overall I liked the article and wasn't put off by the slightly confrontational style, but this comment threw me a bit:

    You probably think I'm going to talk about very old character sets like EBCDIC here. Well, I won't. EBCDIC is not relevant to your life. We don't have to go that far back in time.

    I was once chatting with an IBM pre-sales consultant who gave me the fact-oid that something like 80% of the data in the world live on mainframes in EBCDIC (VSAM?) files i.e. the vast majority of data is not in relational databases.

    I must say that I am still suprised by this and not sure I believe it but the assertion that EBCDIC is dead and of no importance is, I think, flawed.

    Regards,
    Dom.

Re: Programmers, script languages, and Unicode
by Abigail-II (Bishop) on Oct 12, 2003 at 01:23 UTC
    I know Unicode exists, I know the basics of how it works, and I've even once written a "Hello, world" type of program that emitted Unicode.

    But that's it. I've no practical usage for Unicode. I have somewhere the ability to bring up an xterm that's able to display some of the Unicode fonts (but far from all the defined code points) and I don't even remember whether it's on my PC or my laptop (or was it my previous laptop?).

    Perl would still be as useful as it is for me (whether work related or private stuff) if it didn't had Unicode support. I won't be interested in Unicode until my xterm with its "6x13" font is able to display Unicode, and I've a convenient way of inputting Unicode - right now, I can't even remember how to input accented letters when using my editor.

    As for the second question, I don't do PHP.

    Abigail

Re: Programmers, script languages, and Unicode
by IlyaM (Parson) on Oct 14, 2003 at 09:37 UTC
    Recently I tried to do a project with Unicode support in Perl. Short summary: Perl 5.8.x must have, it kinda works but many things don't play nicely with Unicode (mainly XS modules) so often you end up looking for workarounds. More about this in my journal on use.perl.org.

    P.S. To those who commented that they never needed Unicode: you are just lucky to be English native speakers and you are just lucky to have customers of same origin who never need anything other ASCII or at maximum Latin1 in their apps. I guess most of you never realize how bad state of things and how many apps are broken when you do need something out of ASCII or Latin1 domain. World would be much better place if software developers were not Unicode ignorant.

    --
    Ilya Martynov, ilya@iponweb.net
    CTO IPonWEB (UK) Ltd
    Quality Perl Programming and Unix Support UK managed @ offshore prices - http://www.iponweb.net
    Personal website - http://martynov.org

Re: Programmers, script languages, and Unicode
by dragonchild (Archbishop) on Oct 12, 2003 at 03:33 UTC
    Somewhat off the subject, but I saw this and had to comment. This is a quote from PHP's string page.

    Note: It is no problem for a string to become very large. There is no practical bound to the size of strings imposed by PHP, so there is no reason at all to worry about long strings.

    "Don't worry about big strings. PHP will make sure you can use them." Uh-huh. Seems like File::Stream should be encoded in PHP, seeing as it was worried about infinite streams ... :-)

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Programmers, script languages, and Unicode
by jonadab (Parson) on Oct 12, 2003 at 11:47 UTC

    Living in a small, city in the middle of Ohio... I have yet to discover any use whatsoever for unicode. I think you only need it if you want to support languages besides English, or some other esoteric thing nobody around here ever has any reason to do ;-)


    $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
Re: Programmers, script languages, and Unicode
by cbraga (Pilgrim) on Oct 11, 2003 at 14:18 UTC
    Joe Spolsky is an arrogant idiot who did write a couple good articles, such as "Good software takes ten years to write" and is now only coasting on the success of those few articles. For a long, long time he's been writing nothing but opinionated rants that have no value. I don't even bother reading his blog anymore.
      Well, i found the article to be quite informative. Joe might possibly be arrogant, but i appreciate his "rants" none-the-less. This one was less rant and more info, so you really should read it instead of ranting yourself. :P

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      
      Joe Spolsky is an

      Well if you can't even get his name right...

      arrogant

      I see this all the time. Someone writes an article stating their view on something and they're instantly labled "arrogant" for doing so. Do you really think this helps to create informative debate?

      idiot

      Perhaps. I also disagree with Joel on many things. Consider this though: what have you written and made available lately? People don't read Joel's articles because they're perfect, they read them because they're accessible, entertaining, and they learn something from them. So are you saying they shouldn't read them? That Joel should stop writing them? Where's a better alternative?

      For a long, long time he's been writing nothing but opinionated rants that have no value

      Perhaps you should read the article. This one provides direct information that can be applied to improve your code and raises awareness of an improtant but rarely considered issue.

      I don't even bother reading his blog anymore.

      So why are you commenting here? If you're going to comment on the quality of something, at least know what it is.

      Nice to see that the trolls are registering accounts though. I'm sure that will make many people happy.

      Joe Spolsky

      s/Joe/Joel/;

      Well, whatever your views are on Spolsky (and I must admit that I think he's a little overrated), I thought this particular article was actually quite interesting.

      -- vek --