P is for Practical | |
PerlMonks |
Re^3: Proper Unicode handling in Perlby haj (Vicar) |
on Sep 05, 2019 at 02:23 UTC ( [id://11105646]=note: print w/replies, xml ) | Need Help?? |
Short answer: I can reproduce your strings '╨£╨╛╨╣ ╤é╨╡╤ü╤é' and '╨æ╨░╨▒╨║╨╕╨╜╤è ╨£╨╕╤à╨░╨╕╨╗╤è' from the correct ones 'Мой тест' and 'Бабкинъ Михаилъ'. It looks like you print strings which are intended for a UTF-8-enabled browser to a terminal which doesn't understand UTF-8. I guess you are using a Windows terminal with a CP437-compatible (cyrillic) codepage. Please retry your test after entering the command chcp 65001. Long answer follows. I wrote:Obviously, You need to save your source code UTF-8 encoded.You ask: Is this a thing? To my understanding, it is the opinion of the software which opens the file as to what its encoding is. Yes, it is a thing, which I keep explaining to people who seem to be unaware of the many places where encoding and decoding take place behind the scenes. If you see a cyrillic character on your editor screen, then you see a glyph which looks like, say, Б. The Unicode consortium has assigned the codepoint U+0411 and the name CYRILLIC CAPITAL LETTER BE to this character. When such a character is written to a file, then the editor doesn't paint the glyph, nor does it write the codepoint number. Instead, it converts it into a sequence of bytes according to some encoding. In UTF-8, a Б is represented by the (hexadecimal) sequence D091, in Windows Codepage 1251 it is represented by the sequence C1, and in Windows Codepage 866 it is represented by the sequence 81. About 20 years ago, Roman Czyborra collected these and other encodings of the cyrillic alphabets under The Cyrillic Charset Soup. Your editor has to chose one of the encodings. It does so according to some system or user preferences, but every editor is different, and some might not even provide decent information about their choice. If you use, for example, Emacs (available on Windows, too), then the buffer's default encoding is displayed on-screen, but you can override it when saving the file. Editors which claim to support Unicode ought to be able to save files under at least one of the UTF encodings. Maybe other monks have current data, but I recall times where Windows editors like Notepad and Notepad++ saved "Unicode" files under UTF-16-BE, which is not UTF-8 and represents a Б by 0411. This looks like the codepoint number. This is no coincidence, but led some software engineers to the wrong conclusion that this is "the Unicode encoding". Now what happens if an editor opens an existing file? Where does it derive its opinion from? Well, in general, it can't. The byte C1 could either mean a Б, or an Á, if the file was meant to be read as Windows Codepage 1252. A sequence D091 renders as Б under Windows Codepage 1252, as Р‘ under Windows Codepage 1251, and as Б under UTF-8. But again, there is a special case: In UTF-16 encodings, there are two possible ways to write 0411 to disk, depending on whether your hardware architecture is "little endian" or "big endian". To distinguish between these two, the standards use the special character Unicode Character 'ZERO WIDTH NO-BREAK SPACE'. A space which doesn't break words and has no width is pretty invisible, so it doesn't do any harm. Little endian systems write this as FEFF, while big endian system swap the bytes and write FFFE. So, whenever a file starts with either FEFF or FFFE, the editor can with some confidence assume that the encoding is UTF-16-LE or UTF-16-BE, respectively. If that invisible space is the first character of a file, it is called a Byte Order Mark, BOM. For UTF-8, the BOM is optional and rarely used, some programs don't like it if it is there, and it has the byte sequence EFBBBF. There is no such thing as a BOM for any of the one-byte encodings. If a file does not start with a BOM, you have nothing. Similar things happen when the Perl interpreter reads a file. Per default, Perl 5 expects ISO-8859-1 encoding for its source code, which has no BOM. So, if your source code contains the Byte C1 in a literal, then Perl interprets it as the letter Á, and if it contains the bytes D091 in a literal, then Perl interprets it as a Ð followed by a non-printable character, because 91 maps to a control character in ISO-8859-1. To allow human-readable Unicode characters in Perl 5 sources so that you can write Б instead of "\x{411}", the pragma use utf8; was introduced. This, however, requires that the "\x{411}" has been written to disk as the sequence D091 by your editor. I've already said this: There is no pragma for UTF-16 or any other Unicode (or Cyrillic) encoding. You wrote:To my eye, he has all of the russian on the hook with his data queries; it's just not getting represented correctly on the terminal that Strawberry Perl gives you. This is another wrong assumptions. It isn't Strawberry Perl which gives you the terminal, it is the Windows operating system. And - truth hurts - if you spit out UTF-8 encoded strings to a Windows terminal, then it might or might not create the correct glyphs. The terminal is, like your editor, a piece of software which receives a bunch of bytes and tries to create the correct character glyphs for you, according to some encoding. The default encoding of the Windows terminal is not UTF-8, but instead some codepage defined in the regional settings of the operating system (I'm currently on a Linux box, so doing that recherche is up to you). You can learn what codepage is active with the chcp command, and you can also switch your Windows terminal to UTF-8 with the command chcp 65001.
In Section
Seekers of Perl Wisdom
|
|