Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Btw, i am still testing your last test block (first 3 are dependent of Perl which i don't trust completly itself)

The first test does not depend on perl, you can use any language. And for the next two tests, I just was lazy. It is ok to use some other language you trust instead. Use C if you have no better idea, or Ruby, Lua, FORTRAN, Pascal, Modula, bash, ksh, Postscript, whatever you like and has the required features.

[...] so beginner or any Unicode user could just say something like: use utf8_everywhere; How could it break any older code?

Pretending it would currently be possible to implement "use utf8_everywhere", it would break every single piece of code that assumes characters are bytes. Just look at people who write perl scripts on Unix-like systems that read binary data. open-read-close and open-write-close worked fine for the last few decades, there never was a need to insert a binmode statement. You could even read binary data from STDIN and write it to STDOUT and STDERR. Of course, if you wanted to be extra-sure or thought of porting the script to classic MacOS, the Microsoft world, or some strange IBM machines, you would insert binmode. But in the real world, binmode is not used everywhere where it should have been used, and the code works flawlessly. Forcing UTF-8 semantics on STDIN, STDOUT, STDERR and all filehandles you open until you explicitly turn it off (using binmode), and even forcing UTF-8 semantics on from code you did not wrote (scripts and modules using your module or sourcing your script), will break all binary I/O severely. Remember, you defined "use utf8_everywhere" to work for the entire process and without exceptions.

As I tried to explain, it is currently impossible to implement "use utf8_everywhere", simply because the world outside of perl(.exe) is not yet ready to handle Unicode. The first three tests will clearly demonstrate that.

Assume you read four bytes 0x42 0xC3 0x84 0x48, e.g. from STDIN, a file you opened, an inherited file handle, a command line argument or an environment variable. How many characters do these bytes represent? Explain why.

Possible answers:

  • 3, because it's a UTF-8 encoded string, containing the letters B, , and H.
  • 4, because it's a legacy encoded string, containing the letters B and H and two non-ASCII letters between them.
  • 4, because it's a EBCDIC-273 encoded string, containing the letters , C, d, and .
  • 0, because it's binary data from a larger stream, encoding the 32-bit integer 0x4884C342
  • 0, because it's binary data from a larger stream, encoding the 32-bit integer 0x42C38448
  • 0, because it's binary data from a larger stream, encoding two 16-bit integers
  • 42, because it's a 32-bit handle of a GUI resource string of 42 characters
  • 2, because it's a legacy encoded string using two bytes per character
  • 1.33333, because it's a string encoded in a future 24-bit encoding.
  • 0.5, because it's a string encoded in an ancient martian charset ;-)

As you see, it depends on context. For most cases, the operating system does not give you any context information. And for most operating systems, APIs, command line parameters, environment variables, and of course, file I/O are defined in terms of bytes, not in terms of characters. For the environment and command line parameters it is relatively safe to assume that the bytes represent some characters, but you don't have any idea which encoding is used. It could be a legacy encoding, it could be UTF-8, or something completely different, like EBCDIC. If you compile a program to run on Windows using the wide API ("Unicode application"), environment and command line are encoded as UCS-2 (or UTF-16LE, if Microsoft updated the API spec since the last time I read parts of it).


Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to Re^11: any use of 'use locale'? (source encoding) by afoken
in thread any use of 'use locale'? by wanradt

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others imbibing at the Monastery: (3)
    As of 2018-05-26 02:56 GMT
    Find Nodes?
      Voting Booth?