Re^11: any use of 'use locale'? (source encoding)

in reply to Re^10: any use of 'use locale'? (source encoding)
in thread any use of 'use locale'?

Btw, i am still testing your last test block (first 3 are dependent of Perl which i don't trust completly itself)

The first test does not depend on perl, you can use any language. And for the next two tests, I just was lazy. It is ok to use some other language you trust instead. Use C if you have no better idea, or Ruby, Lua, FORTRAN, Pascal, Modula, bash, ksh, Postscript, whatever you like and has the required features.

[...] so beginner or any Unicode user could just say something like: use utf8_everywhere; How could it break any older code?

Pretending it would currently be possible to implement "use utf8_everywhere", it would break every single piece of code that assumes characters are bytes. Just look at people who write perl scripts on Unix-like systems that read binary data. open-read-close and open-write-close worked fine for the last few decades, there never was a need to insert a binmode statement. You could even read binary data from STDIN and write it to STDOUT and STDERR. Of course, if you wanted to be extra-sure or thought of porting the script to classic MacOS, the Microsoft world, or some strange IBM machines, you would insert binmode. But in the real world, binmode is not used everywhere where it should have been used, and the code works flawlessly. Forcing UTF-8 semantics on STDIN, STDOUT, STDERR and all filehandles you open until you explicitly turn it off (using binmode), and even forcing UTF-8 semantics on from code you did not wrote (scripts and modules using your module or sourcing your script), will break all binary I/O severely. Remember, you defined "use utf8_everywhere" to work for the entire process and without exceptions.

As I tried to explain, it is currently impossible to implement "use utf8_everywhere", simply because the world outside of perl(.exe) is not yet ready to handle Unicode. The first three tests will clearly demonstrate that.

Assume you read four bytes 0x42 0xC3 0x84 0x48, e.g. from STDIN, a file you opened, an inherited file handle, a command line argument or an environment variable. How many characters do these bytes represent? Explain why.

Possible answers:

3, because it's a UTF-8 encoded string, containing the letters B, Ä, and H.
4, because it's a legacy encoded string, containing the letters B and H and two non-ASCII letters between them.
4, because it's a EBCDIC-273 encoded string, containing the letters â, C, d, and ç.
0, because it's binary data from a larger stream, encoding the 32-bit integer 0x4884C342
0, because it's binary data from a larger stream, encoding the 32-bit integer 0x42C38448
0, because it's binary data from a larger stream, encoding two 16-bit integers
42, because it's a 32-bit handle of a GUI resource string of 42 characters
2, because it's a legacy encoded string using two bytes per character
1.33333, because it's a string encoded in a future 24-bit encoding.
0.5, because it's a string encoded in an ancient martian charset ;-)

As you see, it depends on context. For most cases, the operating system does not give you any context information. And for most operating systems, APIs, command line parameters, environment variables, and of course, file I/O are defined in terms of bytes, not in terms of characters. For the environment and command line parameters it is relatively safe to assume that the bytes represent some characters, but you don't have any idea which encoding is used. It could be a legacy encoding, it could be UTF-8, or something completely different, like EBCDIC. If you compile a program to run on Windows using the wide API ("Unicode application"), environment and command line are encoded as UCS-2 (or UTF-16LE, if Microsoft updated the API spec since the last time I read parts of it).

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In Section Seekers of Perl Wisdom