http://www.perlmonks.org?node_id=809405


in reply to Re^9: any use of 'use locale'? (source encoding)
in thread any use of 'use locale'?

I'd like to have a sandbox in Perl, where unicode were treated naturally.
Unicode is largely treated naturally in Perl (at least since 5.8.1).

Agreed. But to get "sand" (characters) to this box is painful. AFAIU, we have already today in Perl all (at least most) needed pieces to control bits coming from outside world. (And let us expect, that OS is perfect, because it is out of our control.) Those pieces are OUT THERE, but not together.

I can use them together, but it took too much time to put the puzzle together. I am still not sure, are they now correctly on place and whole thing is too fragile. If in Perl development something changes, i may got broken code as well. For example: for me was great solution to use -C on first line, but from 5.10 it was deprecated. What should i do?

Instead of such puzzle i'd like to have something, which takes those technics correctly together, so beginner or any Unicode user could just say something like:

use utf8_everywhere;

How could it break any older code?

Btw, i am still testing your last test block (first 3 are dependent of Perl which i don't trust completly itself). No major problems so far with files (named 'zzzⲊфӨ✺☻.txt' and 'zzzⲊфӨ✺☻.svg'), but i have limited network possibilities for now and no other OSes except Kubuntu. One tiny problem so far: Padre (!) file dialogs do not use my locale to sort files. So far i am pretty sure, that Linux main distros in core (system level) are Unicode ready, even if we can find some apps or other OSes which can't act together.

Nõnda, WK

Replies are listed 'Best First'.
Re^11: any use of 'use locale'? (source encoding)
by afoken (Chancellor) on Nov 29, 2009 at 00:59 UTC
    Btw, i am still testing your last test block (first 3 are dependent of Perl which i don't trust completly itself)

    The first test does not depend on perl, you can use any language. And for the next two tests, I just was lazy. It is ok to use some other language you trust instead. Use C if you have no better idea, or Ruby, Lua, FORTRAN, Pascal, Modula, bash, ksh, Postscript, whatever you like and has the required features.

    [...] so beginner or any Unicode user could just say something like: use utf8_everywhere; How could it break any older code?

    Pretending it would currently be possible to implement "use utf8_everywhere", it would break every single piece of code that assumes characters are bytes. Just look at people who write perl scripts on Unix-like systems that read binary data. open-read-close and open-write-close worked fine for the last few decades, there never was a need to insert a binmode statement. You could even read binary data from STDIN and write it to STDOUT and STDERR. Of course, if you wanted to be extra-sure or thought of porting the script to classic MacOS, the Microsoft world, or some strange IBM machines, you would insert binmode. But in the real world, binmode is not used everywhere where it should have been used, and the code works flawlessly. Forcing UTF-8 semantics on STDIN, STDOUT, STDERR and all filehandles you open until you explicitly turn it off (using binmode), and even forcing UTF-8 semantics on from code you did not wrote (scripts and modules using your module or sourcing your script), will break all binary I/O severely. Remember, you defined "use utf8_everywhere" to work for the entire process and without exceptions.

    As I tried to explain, it is currently impossible to implement "use utf8_everywhere", simply because the world outside of perl(.exe) is not yet ready to handle Unicode. The first three tests will clearly demonstrate that.

    Assume you read four bytes 0x42 0xC3 0x84 0x48, e.g. from STDIN, a file you opened, an inherited file handle, a command line argument or an environment variable. How many characters do these bytes represent? Explain why.

    Possible answers:

    • 3, because it's a UTF-8 encoded string, containing the letters B, Ä, and H.
    • 4, because it's a legacy encoded string, containing the letters B and H and two non-ASCII letters between them.
    • 4, because it's a EBCDIC-273 encoded string, containing the letters â, C, d, and ç.
    • 0, because it's binary data from a larger stream, encoding the 32-bit integer 0x4884C342
    • 0, because it's binary data from a larger stream, encoding the 32-bit integer 0x42C38448
    • 0, because it's binary data from a larger stream, encoding two 16-bit integers
    • 42, because it's a 32-bit handle of a GUI resource string of 42 characters
    • 2, because it's a legacy encoded string using two bytes per character
    • 1.33333, because it's a string encoded in a future 24-bit encoding.
    • 0.5, because it's a string encoded in an ancient martian charset ;-)

    As you see, it depends on context. For most cases, the operating system does not give you any context information. And for most operating systems, APIs, command line parameters, environment variables, and of course, file I/O are defined in terms of bytes, not in terms of characters. For the environment and command line parameters it is relatively safe to assume that the bytes represent some characters, but you don't have any idea which encoding is used. It could be a legacy encoding, it could be UTF-8, or something completely different, like EBCDIC. If you compile a program to run on Windows using the wide API ("Unicode application"), environment and command line are encoded as UCS-2 (or UTF-16LE, if Microsoft updated the API spec since the last time I read parts of it).

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      I have more thinking to do with your answer, but main (for me) questions i can review.

      Pretending it would currently be possible to implement "use utf8_everywhere", it would break every single piece of code that assumes characters are bytes.

      Do you mean, that would break any code, which i incorporate to my project if i declare "use utf8_everywhere"? Then, why it does not break now when i use same things (like "use utf8", "use open", binmode...) separately?

      Or do you mean, it will break "characters are bytes"-code, if they use "use uf8_everywhere"? Then i agree, but why they need such declaration?

      Assume you read four bytes 0x42 0xC3 0x84 0x48, e.g. from STDIN, a file you opened, an inherited file handle, a command line argument or an environment variable. How many characters do these bytes represent? Explain why

      If i am declared fully utf8 environment, i treat them being 3 utf8 characters. If i need some other behaviour, i ask it explicitly to handle the source which from those bytes are coming. Where is the problem?

      Nõnda, WK