Clear questions and runnable code
get the best and fastest answer
Re^9: any use of 'use locale'? (source encoding)by afoken (Parson)
|on Nov 23, 2009 at 16:19 UTC||Need Help??|
You answered like i suppress unicode to everyone and to everywhere. That's not my goal.
No, and I did not understand you like that. I just wanted to explain why Unicode support still sucks so much. It's not just a Perl problem, some big problems with Unicode are outside our control. I would be happy if I could use utf8_for_everything;, but that cannot work today. We could end implementing use utf8_where_possible; plus the same manual fiddling to turn on Unicode in subsystems that do not (yet) understand use utf8_where_possible; or that need workarounds.
I'd like to have a sandbox in Perl, where unicode were treated naturally.
Unicode is largely treated naturally in Perl (at least since 5.8.1). You put binary data, a string with legacy encoding, or a string with a Unicode encoding into a scalar and everything works as expected. All the magic happens behind the scenes. Things become ugly as soon as you start interfacing with the outside world, e.g. STDIN, STDOUT, STDERR, %ENV, @ARGV, and external libraries (XS code). DBI and the DBD::* modules currently gain more and more Unicode support, simply by reviewing and changing every place in the code where Perl scalars become C strings and vice versa to respect or to set the internal Unicode flag of the Perl scalar. Sometimes by passing a Perl scalar further down into the code instead of passing a C string (this happened with some internal DBI APIs). Sometimes by converting Perls idea of Unicode to and from what the operating system or a library expects. (This happens in DBD::ODBC.)
Trying make it more clear. I am not familiar with Perl history so good, but let me make assumption, that in some phase there was no strict-pragma. OK? Then someone thought, it may be good idea and found ways to implement it. Did that break any earlier code? I don't think so. But it made available widely use strict pragma.
This is part of your problem understanding the problems of Unicode support. Perl has a long history and culture of NOT breaking old code. use strict is an example for this. The inventors of strict could have turned strict on by default, and force people to update the legacy code by either adding no strict or by cleaning up the old junk. This would perhaps reduced the ammount of bad Perl code a lot, and would have forced newbies to write cleaner code. But many people would have gotten very angry because millions lines of code would stop working from one day to the other, just because the lastest f*** Perl update started bean counting instead of getting the job done.
The same thing happened with Unicode support, and you will find some good explainations inside the Perl documentation why Unicode support is largely OFF by default. Turning it on by default would have broken even more code that assumes a character is a byte.
So i am talking now. As far as i see, for module authors is there no possibility to see, do the module caller uses utf8 or not. Am i correct? And, does it break any earlier code, if they would have such a possibility? That would be a single step, IMHO :)
Wrong problem. The module caller may use a mix of Unicode, legacy encoding and binary data at any time. For any function or method in a module or class, it is completely irrelevant if "the caller uses utf8" or not.
Modules (or better: their authors) must no longer assume that scalars contain bytes, they contain arbitary large characters. length returns the number of characters in a scalar. If the internal Unicode flag on a scalar is turned off, the module may safely assume that the scalar contains bytes, either binary data or a legacy encoding. When it is on, it must correctly handle large characters. When interfacing with the outside world (O/S, network, database, ...), it may be necessary to convert the large characters to a different encoding (and back, of course). Whenever scalars are returned, they may either have the Unicode flag set and may contain large characters, or they have the flag cleared and must not contain large characters, not even as a UTF byte stream. (Except, of course, the purpose of the function is to generate UTF byte streams.)
Many modules do not need changes, because they did not assume byte==character from the beginning, and so Perl automatically does the right thing. Some modules tried to handle Unicode all by themselves even before Perl hat Unicode support, Template::Toolkit seems to be such a module. They mostly work, and as long as you don't mix them with really Unicode-capable modules, nothing wrong happens. Their only problem is that their scalars contain UTF byte streams instead of large characters. This can only be solved by either dropping support for legacy Perl versions (i.e. use 5.008_001) or by having the module code behave different for old and new Perl versions.
I have not deeply investigated, how unicode-proof is Linux for now, but on system level i have'nt any complains already years (Debian and Kubuntu). If you could me give some hints, how determine unicode use, i'd like to test it.
OK, some simple problems. "Unicode string" here means a string containing characters outside the ASCII and ISO range, e.g. the smiling face, cyrillic letters, or the like. See http://cpansearch.perl.org/src/MJEVANS/DBD-ODBC-1.23/t/40UnicodeRoundTrip.t for examples. "Legacy string" means a string any non-Unicode encoding, like ASCII, ISO-8859-x, the various asiatic encodings, and so on.
Yes, this are stupid little tests, except for the last one. They are very similar to what the UnicodeRoundTrip test linked above does. You would not believe how often that simple test broke. And before I added the test, I had even more problems with data that was modified somewhere between Perl and the database engines, causing other tests to break very misteriously or even to fail silently.
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)