|laziness, impatience, and hubris|
Re: ASCII, Unicode, use utf8: My Story of Discoveryby John M. Dlugosz (Monsignor)
|on Nov 01, 2002 at 19:45 UTC||Need Help??|
In your first listing, the regex engine will match on every byte, so if you feed it a UTF-8-encoded file it will report multiple-byte sequences as its component values. Meanwhile, you are unpacking 'C', which also "only does bytes". So the program is consistant, but the output labeling is wrong: it's not "Unicode Value:", it's "byte value:".
On your second listing, the regex is in the scope of use utf8, so the dot will match a multi-byte character as one character. But then you use unpack 'C' again which ignores the fact that $1 might have a multi-byte character in it, and just returns the value of the first byte.
Now UTF8-encoding is designed to overcome the headaches of past variable-length encoding systems. It's very easy because a character that's represented by a single byte always has the high bit cleared, and all bytes that are part of a multi-byte sequence all have their high bits set.
So, when you test for your range of 32..126 inclusive, you are indeed going to test for ASCII graphics characters because (by design) UTF8 is a proper superset of ASCII. You are picking out those bytes that are single-byte characters that also are not control codes (<32) or the DEL character (127).
The unpack is a roundabout way of doing that. If you just used ord(), you would respect the multibyte nature of what's in $1, and get numbers >256 as applicable. This would work about the same but would not rely on this artifact of UTF8 encoding.
But you can skip the while loop completely!
will also return false if the $column contains any character outside of that range, true if it contains only characters in that range.
Meanwhile, ASCII doesn't have a character 231, since it only goes up to 127. Windows displays an ANSI character in your current code page, which varies based on what country you are in. The command window is using the OEM character set to be compatible with old DOS text programs, which is why it interprets 231 as a different character!