Re^9: Standard handles inherited from a utf-8 enabled shellby repellent (Priest)
|on Mar 22, 2012 at 09:59 UTC||Need Help??|
You mentioned (emphasis mine):
I agree with this, but I believe we have different assumptions on what is meant by interpretation. Look, I need a way to refer to that number, because that is fundamental. I call that number a "character". The value of that number is what I call the "codepoint value". Bear with me: forget "Unicode" for now, and grant me the use of those words. At any time, you may s/character|codepoint/_that_number_/gi.
Before that sentence, you mentioned:
Well, that number is 255 == ord(pack 'B8', '11111111'). Saying it's a (single) byte means you've established the number of bits for it is 8. That, to me, is giving the number an interpretation(*). This observation is very important when it comes to the subject of encoding, especially when we're to print that character (i.e. that number).
If you want to print a string, you should avoid any preconceived notion of how many bits the string "has" prior to deciding which encoding to use. I find thinking in terms of characters (i.e. those numbers) and what their codepoint values (i.e. the number values) are, helps tremendously in my handling of strings up to the point where they are encoded using print. That is my thought process, and the message I was trying to deliver.
(*) I am aware of the details of how perl stores that number in memory, but not as well versed as you. I would like to reiterate that this discussion is about print and encoding, and that the ordinal of the character is what matters here.
And that's the thing: the concept of encoding alone does not make sense without the concept of characters (what we're encoding). And those characters can only exist within the process (e.g. numbers in Perl's "string"). Our computer "systems" (e.g. web browser, text editor, terminal, program, etc.) do this decode-incoming-octets-then-output-octets-already-encoded dance between each other to handoff characters.
When Perl warns you about "Wide character in print", what it's really saying is: Please be explicit about the encoding so that I can tell the next "system" about my characters accurately, using only octets.