Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re: ASCII, Unicode, use utf8: My Story of Discovery

by John M. Dlugosz (Monsignor)
on Nov 01, 2002 at 19:45 UTC ( #209823=note: print w/replies, xml ) Need Help??

in reply to ASCII, Unicode, use utf8: My Story of Discovery

In your first listing, the regex engine will match on every byte, so if you feed it a UTF-8-encoded file it will report multiple-byte sequences as its component values. Meanwhile, you are unpacking 'C', which also "only does bytes". So the program is consistant, but the output labeling is wrong: it's not "Unicode Value:", it's "byte value:".

On your second listing, the regex is in the scope of use utf8, so the dot will match a multi-byte character as one character. But then you use unpack 'C' again which ignores the fact that $1 might have a multi-byte character in it, and just returns the value of the first byte.

Now UTF8-encoding is designed to overcome the headaches of past variable-length encoding systems. It's very easy because a character that's represented by a single byte always has the high bit cleared, and all bytes that are part of a multi-byte sequence all have their high bits set.

So, when you test for your range of 32..126 inclusive, you are indeed going to test for ASCII graphics characters because (by design) UTF8 is a proper superset of ASCII. You are picking out those bytes that are single-byte characters that also are not control codes (<32) or the DEL character (127).

The unpack is a roundabout way of doing that. If you just used ord(), you would respect the multibyte nature of what's in $1, and get numbers >256 as applicable. This would work about the same but would not rely on this artifact of UTF8 encoding.

But you can skip the while loop completely!

use utf8; return ! ($column =~ /[^\x21-\x7f]/);
will also return false if the $column contains any character outside of that range, true if it contains only characters in that range.

Meanwhile, ASCII doesn't have a character 231, since it only goes up to 127. Windows displays an ANSI character in your current code page, which varies based on what country you are in. The command window is using the OEM character set to be compatible with old DOS text programs, which is why it interprets 231 as a different character!


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://209823]
LanX great movie!
[Discipulus]: ah i become monsignor the same day of sysadmin day
[marto]: not so great OS :P
Eily never heard of the movie "LanX"
[Eily]: the LanX OS either now that I think of it :P
[hippo]: Congratulations, Discipulus!
[Discipulus]: Satura Lanx was a latin genre of show
[Discipulus]: coming from the meaning of 'mixed dish'
[Discipulus]: thanks hippo! the hat you passed me was brand new and clean, thanks

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2017-07-28 09:59 GMT
Find Nodes?
    Voting Booth?
    I came, I saw, I ...

    Results (426 votes). Check out past polls.