Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^7: any use of 'use locale'? (source encoding)

by afoken (Parson)
on Nov 21, 2009 at 22:09 UTC ( #808627=note: print w/ replies, xml ) Need Help??


in reply to Re^6: any use of 'use locale'? (source encoding)
in thread any use of 'use locale'?

Yes, Unicode support is not as good as it could be. We are in a transition phase from ASCII, the various ISO-8859-n encodings, and several multibyte encodings to Unicode. ASCII is about 35 years older than Unicode, the ISOs are still about 10 years older. The biggest problem of Unicode is that a char is no longer the same as a byte, which breaks at least 35 years of code. (At my current job, nobody knows Unicode. They still talk about ASCII, and will continue to do so for at least the next decade. So, introducing Unicode breaks 40 to 50 years of code.) And to make things worse, all Unicode encodings except for UTF-8 typically contain lots of NUL bytes, breaking even more code that expects NUL bytes only at the end of strings.

A CPU (and all of the other hardware) has no problems with Unicode. It's not a hardware problem at all. So, the problem must start at the operating system:

Nearly all of our current and legacy file systems assume that a char and a byte are the same, and often they also assume that a NUL byte marks the end of a filename. So, we need to change the filesystems. Very often, UTF-8 can be used instead of ASCII, leaving only some problems of byte lengths vs. character lengths and of all those old byte-based characters above 0x7F. In fact, we need to know what encoding is used for each filename, or at least for each instance of each filesystem. The operating system needs to take care of the different encodings, and offer a Unicode-based API for the filesystems. Windows has ASCII and Wide APIs for this purpose, but as far as I understand, Wide means UCS-2, which is only a subset of UTF-16 and does not the entire Unicode set. ASCII has no support for Unicode. I'm not quite sure weather Linux has an 8-bit-transparent API that is able to pass UTF-8 or has a real UTF-8-based API.

So, now that we can have filenames and especially directory names in Unicode, $ENV{'PATH'} must be able to contain Unicode characters, and some other environment variables, too. So, we need a Unicode environment, preferably with support for Unicode keys. As far as I understand, Windows offers a UCS-2 environment to "Unicode" programs and an ASCII environment for non-Unicode programs. Linux provides an 8-bit-clean environment and lets each program decide about the encoding of the environment.

As for the environment, the command line arguments must be able to contain Unicode. The same game here, Linux passes a NUL-terminated array of bytes and lets the program decide about the encoding, Windows offers two APIs depending on how the program was compiled and linked.

All of those really basic things about running a program are not yet complete. I simply do not know any operating system that treats each and every string passed to its APIs as Unicode.

A completely different problem are text files of all kind, starting with what we call "plain text", scripts, source code, logs and so on. For each text file we read or write, we need to know its encoding. Current operating systems can not give us the slightest hint about the encoding. HTML and XML have a default encoding and may contain hints about a different encoding. So, I/O in text mode is a huge and unsolved problem.

Networking: IP, TCP and UDP are all about stuffing bytes into tubes and collecting those that fall out of other tubes. ;-) No problem so far. The problems arise at higher levels, where the protocols start working with text strings. Think about the unfortunate punycode used in DNS. Think about e-mail accounts. E-Mail and HTTP have at least a Charset header, solving the problem of the content. But headers are still ASCII. E-Mail-Adresses are passed in the header. Think about FTP. I don't know how FTP would or should handle Unicode filenames.

If we could throw away all old and existing systems and simply start a new set of operating systems, file systems and network protocols, everything would be easy and simple: Store a charset (and a content-type) with each and every file, and use some Unicode encoding instead of ASCII.

Some newer languages took their advantage of not having legacy sources. Perl is older than Unicode, and has a big legacy of old code that has to be supported. Perl 5 is about as old as Unicode, but Unicode was simply not relevant when Perl 5 was released.

Sure, it would have been nice to have Perl 5.000 with full Unicode support, but what operating system would have been able to run it?

What operating system can currently provide perl with a complete Unicode environment (%ENV, @ARGV, STDIN, STDOUT, STDERR, open, opendir, mkdir, rmdir, unlink, ...)?

All Unicode problems are still transition problems. Your hypothetic "everything-is-Unicode"-flag could be implemented some day, when all Perl Module authors (or at least those of the major modules) have changed their code to fully support Unicode, and when Perl can use a Unicode API on all major operating systems.

Look at DBI and the various DBDs. The first DBI version having a little bit of Unicode support is 1.38 dated 2003-Aug-21. DBD::Oracle got some Unicode support in 1.13 dated 2003-Mar-14, but to get real Unicode support, you needed at least Oracle 9, released in 2001. DBD::Pg got Unicode support with version 1.22, dated 2003-Mar-26. DBD::ODBC had no Unicode support at all until I started messing with its code and the Windows API and published a patch 2006-Mar-03. After some discussions on dbi-users, Martin J. Evans cleaned up after me and released DBD::ODBC 1.14 dated 2007-Jul-17 with minimal Unicode support. DBD::mysql got the first parts of its Unicode support in 3.0004_1, dated 2006-May-17.

And now, file APIs. Perl on Windows uses the ASCII APIs for file I/O, probably because using the Unicode APIs would break lots of code, especially when it comes to command line arguments and the environment. And perhaps because until recently, Perl supported Windows 9x lacking the several parts of the Unicode APIs. On other systems, there aren't even APIs where programs can talk in Unicode with the operating system.

So, what can be done?

  • Try to find Unicode APIs. ODBC was easy, because it already existed and was (kind of) documented.
  • Try to get Unicode APIs implemented. Again, ODBC was easy because it was already done. Operating systems will be hard, because you need to change everything: kernel APIs, process structures, file systems, shells, standard utilities.
  • Provide patches and tests to get the Unicode API s implemented.
  • Provide patches and tests to get Perl and Module code ported to the Unicode APIs.
  • If you can not do that, make people talk about the problems. Try to get them in a room and let them find a solution.
  • Or find a sponsor that pays someone to solve a problem. I wrote my Unicode patch during my work hours, simply because my work project needed it. After a short discussion with my boss ("We took so much from the community, now let's pay back a little by publishing that patch - it does not harm anybody and does not expose any of our secrets"), I got the permission to publish it.

We won't be able to make a big jump forward, flip a switch and have all Unicode problems solved. But we can make small steps. Every journey begins with a single step.

Expect a few more years until Unicode has truely become universal, and a few more years for all code writers to keep up. I think that the major problems at the O/S and network level need to be solved first, before we can change Perl. Windows could be a good test environment, because it already has Unicode APIs.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)


Comment on Re^7: any use of 'use locale'? (source encoding)
Select or Download Code
Re^8: any use of 'use locale'? (source encoding)
by wanradt (Scribe) on Nov 22, 2009 at 22:46 UTC

    Thank you, Alexander! It was widening, but did not answered to my most important questions. You answered like i suppress unicode to everyone and to everywhere. That's not my goal.

    I'd like to have a sandbox in Perl, where unicode were treated naturally.

    Trying make it more clear. I am not familiar with Perl history so good, but let me make assumption, that in some phase there was no strict-pragma. OK? Then someone thought, it may be good idea and found ways to implement it. Did that break any earlier code? I don't think so. But it made available widely use strict pragma.

    So i am talking now. As far as i see, for module authors is there no possibility to see, do the module caller uses utf8 or not. Am i correct? And, does it break any earlier code, if they would have such a possibility? That would be a single step, IMHO :)

    What operating system can currently provide perl with a complete Unicode environment (%ENV, @ARGV, STDIN, STDOUT, STDERR, open, opendir, mkdir, rmdir, unlink, ...)?

    I have not deeply investigated, how unicode-proof is Linux for now, but on system level i have'nt any complains already years (Debian and Kubuntu). If you could me give some hints, how determine unicode use, i'd like to test it.

    Nġnda, WK
      You answered like i suppress unicode to everyone and to everywhere. That's not my goal.

      No, and I did not understand you like that. I just wanted to explain why Unicode support still sucks so much. It's not just a Perl problem, some big problems with Unicode are outside our control. I would be happy if I could use utf8_for_everything;, but that cannot work today. We could end implementing use utf8_where_possible; plus the same manual fiddling to turn on Unicode in subsystems that do not (yet) understand use utf8_where_possible; or that need workarounds.

      I'd like to have a sandbox in Perl, where unicode were treated naturally.

      Unicode is largely treated naturally in Perl (at least since 5.8.1). You put binary data, a string with legacy encoding, or a string with a Unicode encoding into a scalar and everything works as expected. All the magic happens behind the scenes. Things become ugly as soon as you start interfacing with the outside world, e.g. STDIN, STDOUT, STDERR, %ENV, @ARGV, and external libraries (XS code). DBI and the DBD::* modules currently gain more and more Unicode support, simply by reviewing and changing every place in the code where Perl scalars become C strings and vice versa to respect or to set the internal Unicode flag of the Perl scalar. Sometimes by passing a Perl scalar further down into the code instead of passing a C string (this happened with some internal DBI APIs). Sometimes by converting Perls idea of Unicode to and from what the operating system or a library expects. (This happens in DBD::ODBC.)

      Trying make it more clear. I am not familiar with Perl history so good, but let me make assumption, that in some phase there was no strict-pragma. OK? Then someone thought, it may be good idea and found ways to implement it. Did that break any earlier code? I don't think so. But it made available widely use strict pragma.

      This is part of your problem understanding the problems of Unicode support. Perl has a long history and culture of NOT breaking old code. use strict is an example for this. The inventors of strict could have turned strict on by default, and force people to update the legacy code by either adding no strict or by cleaning up the old junk. This would perhaps reduced the ammount of bad Perl code a lot, and would have forced newbies to write cleaner code. But many people would have gotten very angry because millions lines of code would stop working from one day to the other, just because the lastest f*** Perl update started bean counting instead of getting the job done.

      The same thing happened with Unicode support, and you will find some good explainations inside the Perl documentation why Unicode support is largely OFF by default. Turning it on by default would have broken even more code that assumes a character is a byte.

      So i am talking now. As far as i see, for module authors is there no possibility to see, do the module caller uses utf8 or not. Am i correct? And, does it break any earlier code, if they would have such a possibility? That would be a single step, IMHO :)

      Wrong problem. The module caller may use a mix of Unicode, legacy encoding and binary data at any time. For any function or method in a module or class, it is completely irrelevant if "the caller uses utf8" or not.

      Modules (or better: their authors) must no longer assume that scalars contain bytes, they contain arbitary large characters. length returns the number of characters in a scalar. If the internal Unicode flag on a scalar is turned off, the module may safely assume that the scalar contains bytes, either binary data or a legacy encoding. When it is on, it must correctly handle large characters. When interfacing with the outside world (O/S, network, database, ...), it may be necessary to convert the large characters to a different encoding (and back, of course). Whenever scalars are returned, they may either have the Unicode flag set and may contain large characters, or they have the flag cleared and must not contain large characters, not even as a UTF byte stream. (Except, of course, the purpose of the function is to generate UTF byte streams.)

      Many modules do not need changes, because they did not assume byte==character from the beginning, and so Perl automatically does the right thing. Some modules tried to handle Unicode all by themselves even before Perl hat Unicode support, Template::Toolkit seems to be such a module. They mostly work, and as long as you don't mix them with really Unicode-capable modules, nothing wrong happens. Their only problem is that their scalars contain UTF byte streams instead of large characters. This can only be solved by either dropping support for legacy Perl versions (i.e. use 5.008_001) or by having the module code behave different for old and new Perl versions.

      I have not deeply investigated, how unicode-proof is Linux for now, but on system level i have'nt any complains already years (Debian and Kubuntu). If you could me give some hints, how determine unicode use, i'd like to test it.

      OK, some simple problems. "Unicode string" here means a string containing characters outside the ASCII and ISO range, e.g. the smiling face, cyrillic letters, or the like. See http://cpansearch.perl.org/src/MJEVANS/DBD-ODBC-1.23/t/40UnicodeRoundTrip.t for examples. "Legacy string" means a string any non-Unicode encoding, like ASCII, ISO-8859-x, the various asiatic encodings, and so on.

      • Create two environment variables named FOO and BAR, one with a Unicode string of exactly 10 characters as value, the other one with a legacy string of exactly 10 characters. Choose randomly which variable gets the Unicode string. fork() and exec() some child processes (Scripts in bash, perl, ash, ksh, python, ruby, lua, ..., and perhaps some compiled programs written in C, C++, Assembler, Fortran, ...) and let each process report the number of characters in both FOO and BAR, without(!) telling the child processes which of the two variables actually contains Unicode characters.
      • Create a Perl script that writes randomly either a legacy string or a Unicode string to STDOUT, both containing exactly 10 characters. You may use binmode STDOUT,':utf8' and the like to get rid of all warnings and errors. Create a second program (in Perl or any other language) that reads its STDIN and reports the number of characters it read from STDIN. Connect both programs using a pipe, like this: perl writer.pl | perl reader.pl.
      • Create a Perl script that randomly selects either a legacy string or a Unicode string of exactly 10 characters and passes that string as only argument to child processes written in various langaues. Each child process must report the number of characters passed as arguments.
      • Create a file whose name is a Unicode string. Does ls display it correctly? On both the console and via telnet and ssh from different other operating systems? Can rm remove it without resorting to rm -rf *? Can you copy and move it inside midnight commander? Does your preferred X environment display the name correctly? On the desktop and in the file manager? Even inside File-Open dialogs? Even in programs that are not part of the X environment (like Firefox)? What about other X environments (Gnome, KDE, xfce, ...)? Can you pass the file as a command line argument to arbitary programs and can they open it? Does the filename still look ok when you share the file via FTP, HTTP, SMB, NFS, rsync? Can you still open it over the network from Linux, Windows, *BSD, Solaris, MacOS? Can you overwrite it over the network?

      Yes, this are stupid little tests, except for the last one. They are very similar to what the UnicodeRoundTrip test linked above does. You would not believe how often that simple test broke. And before I added the test, I had even more problems with data that was modified somewhere between Perl and the database engines, causing other tests to break very misteriously or even to fail silently.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        I'd like to have a sandbox in Perl, where unicode were treated naturally.
        Unicode is largely treated naturally in Perl (at least since 5.8.1).

        Agreed. But to get "sand" (characters) to this box is painful. AFAIU, we have already today in Perl all (at least most) needed pieces to control bits coming from outside world. (And let us expect, that OS is perfect, because it is out of our control.) Those pieces are OUT THERE, but not together.

        I can use them together, but it took too much time to put the puzzle together. I am still not sure, are they now correctly on place and whole thing is too fragile. If in Perl development something changes, i may got broken code as well. For example: for me was great solution to use -C on first line, but from 5.10 it was deprecated. What should i do?

        Instead of such puzzle i'd like to have something, which takes those technics correctly together, so beginner or any Unicode user could just say something like:

        use utf8_everywhere;

        How could it break any older code?

        Btw, i am still testing your last test block (first 3 are dependent of Perl which i don't trust completly itself). No major problems so far with files (named 'zzzⲊфӨ✺☻.txt' and 'zzzⲊфӨ✺☻.svg'), but i have limited network possibilities for now and no other OSes except Kubuntu. One tiny problem so far: Padre (!) file dialogs do not use my locale to sort files. So far i am pretty sure, that Linux main distros in core (system level) are Unicode ready, even if we can find some apps or other OSes which can't act together.

        Nġnda, WK

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://808627]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (17)
As of 2014-08-27 13:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (238 votes), past polls