![]() |
|
Perl: the Markov chain saw | |
PerlMonks |
Re^3: Is there some universal Unicode+UTF8 switch?by daxim (Curate) |
on Sep 02, 2019 at 11:02 UTC ( [id://11105426]=note: print w/replies, xml ) | Need Help?? |
Your understanding of the topic is still inadequate. I am convinced that you already know "enough to be dangerous", but not enough to arrive at the correctly modelled solution that most other Perl programmers would implement. It's quite difficult to suss out the parts where you still need education or clearing up a misunderstanding, but I will try anyway.
One half of what the utf8 pragma does, is allowing to use non-ASCII character literals in the source code, for example you could simply write use utf8; my $aufrom = "Барсучелло";instead of the much more tedious my $aufrom = "\N{CYRILLIC CAPITAL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER ER}\N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER U}\N{CYRILLIC SMALL LETTER CHE}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER O}";or unreadable my $aufrom = "\N{U+0411}\N{U+0430}\N{U+0440}\N{U+0441}\N{U+0443}\N{U+0447}\N{U+0435}\N{U+043B}\N{U+043B}\N{U+043E}"; The other half of the utf8 pragma is that Perl expects the source code to be encoded in UTF-8. To be clear, that's the encoding option in the text editor. A hex dump of the file with the first example from above would correctly look like: 00000000 75 73 65 20 75 74 66 38 3b 20 6d 79 20 24 61 75 use utf8; my $au 00000010 66 72 6f 6d 20 3d 20 22 d0 91 d0 b0 d1 80 d1 81 from = "Ð�аÑ�Ñ� 00000020 d1 83 d1 87 d0 b5 d0 bb d0 bb d0 be 22 3b 0a Ñ�Ñ�елло";␊If you would mistakenly save the text as Windows-1251, the program would not work correctly anymore. 00000000 75 73 65 20 75 74 66 38 3b 20 6d 79 20 24 61 75 use utf8; my $au 00000010 66 72 6f 6d 20 3d 20 22 c1 e0 f0 f1 f3 f7 e5 eb from = "Áàðñó÷åë 00000020 eb ee 22 3b 0a ëî";␊ # Malformed UTF-8 character: \xc1\xe0 (unexpected non-continuation byte 0xe0, immediately after start byte 0xc1; need 2 bytes, got 1) at … Therefore the name utf8 for the pragma really is appropriate, and unicode is not. > > what encoding is it in? > It is in Unicode.Unicode is not a valid encoding name for a JSON file or JSON HTTP response body. Again, if you look at the hex dump for both formatversion=1 and formatversion=2, you will see that both are octets encoded in UTF-8. # formatversion=2 00000000 7b 22 62 61 74 63 68 63 6f 6d 70 6c 65 74 65 22 {"batchcomplete" 00000010 3a 74 72 75 65 2c 22 63 6f 6e 74 69 6e 75 65 22 :true,"continue" 00000020 3a 7b 22 61 75 66 72 6f 6d 22 3a 22 d0 91 d0 b0 :{"aufrom":"Ð�а 00000030 d1 80 d1 81 d1 83 d1 87 d0 b5 d0 bb d0 bb d0 be Ñ�Ñ�Ñ�Ñ�елло
# formatversion=1 00000000 7b 22 62 61 74 63 68 63 6f 6d 70 6c 65 74 65 22 {"batchcomplete" 00000010 3a 22 22 2c 22 63 6f 6e 74 69 6e 75 65 22 3a 7b :"","continue":{ 00000020 22 61 75 66 72 6f 6d 22 3a 22 5c 75 30 34 31 31 "aufrom":"\u0411 00000030 5c 75 30 34 33 30 5c 75 30 34 34 30 5c 75 30 34 \u0430\u0440\u04 00000040 34 31 5c 75 30 34 34 33 5c 75 30 34 34 37 5c 75 41\u0443\u0447\uTo be more precise, formatversion=2 is normal/boring/genuine UTF-8, and formatversion=1 is US-ASCII (which is a proper subset of UTF-8, this means any file in US-ASCII is also valid UTF-8). formatversion=1 achieves this goal by not using character literals in the encoded JSON, instead it uses the \u character escapes (which by design use octets from the US-ASCII repertoire only: the backslash, the small letter u, and digits 0 to 9). Got that? There are two layers of encoding at play here. Both files/responses are UTF-8, therefore the use of the decode_json function, which expects UTF-8 octets as input, is appropriate for both. The function first decodes UTF-8 octets into Perl characters, but it then also decodes JSON \u character escapes into Perl characters.
In Section
Seekers of Perl Wisdom
|
|