Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^2: Is there some universal Unicode+UTF8 switch?

by VK (Novice)
on Sep 02, 2019 at 05:49 UTC ( #11105420=note: print w/replies, xml ) Need Help??


in reply to Re: Is there some universal Unicode+UTF8 switch?
in thread Is there some universal Unicode+UTF8 switch?

> So your script itself is in UTF-8?
Well, not really. Other already reminded to us that UTF-8 is merely a "transport protocol" for multibyte chars - not what you see and get in the text. So if I have in my code like my $Cyrillic_literal = 'here some Cyr letters but perlmonks.org replaces it with HTML-escapes'; - it is not UTF-8, it is Unicode. If you select and copy that here some Cyr letters but perlmonks.org replaces it with HTML-escapes - you buffer will contain Unicode string, not UTF-8 string. So the proper word would be use unicode; and not use utf8; But it is not me who has chosen the module name, I'm just an end user. It can be called utf8 or even foobar - as long as does the needed (Unicode handling) I do not care too much.

> If the JSON Cyrillic is not UTF-8, what encoding is it in?
It is in Unicode. You want UTF-8 - call like this (see formatversion changed to 1): https://ru.wikipedia.org/w/api.php?action=query&format=json&formatversion=1&list=allusers&auactiveusers&aufrom=Б
The major problem of Perl as I see it (see the module name question higher) that it thinks of UTF-8 and Unicode as something of the same kind while these are two completely different things. From here all its (de|en)coding oops. IMHO.

Replies are listed 'Best First'.
Re^3: Is there some universal Unicode+UTF8 switch?
by daxim (Curate) on Sep 02, 2019 at 07:02 UTC
    Your understanding of the topic is still inadequate. I am convinced that you already know "enough to be dangerous", but not enough to arrive at the correctly modelled solution that most other Perl programmers would implement. It's quite difficult to suss out the parts where you still need education or clearing up a misunderstanding, but I will try anyway.

    One half of what the utf8 pragma does, is allowing to use non-ASCII character literals in the source code, for example you could simply write

    use utf8; my $aufrom = "Барсучелло";
    instead of the much more tedious
    my $aufrom = "\N{CYRILLIC CAPITAL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER ER}\N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER U}\N{CYRILLIC SMALL LETTER CHE}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER O}";
    or unreadable
    my $aufrom = "\N{U+0411}\N{U+0430}\N{U+0440}\N{U+0441}\N{U+0443}\N{U+0447}\N{U+0435}\N{U+043B}\N{U+043B}\N{U+043E}";

    The other half of the utf8 pragma is that Perl expects the source code to be encoded in UTF-8. To be clear, that's the encoding option in the text editor. A hex dump of the file with the first example from above would correctly look like:

    00000000  75 73 65 20 75 74 66 38  3b 20 6d 79 20 24 61 75 use utf8; my $au
    00000010  66 72 6f 6d 20 3d 20 22  d0 91 d0 b0 d1 80 d1 81 from = "��
    00000020  d1 83 d1 87 d0 b5 d0 bb  d0 bb d0 be 22 3b 0a    ��елло";␊
    
    If you would mistakenly save the text as Windows-1251, the program would not work correctly anymore.
    00000000  75 73 65 20 75 74 66 38  3b 20 6d 79 20 24 61 75 use utf8; my $au
    00000010  66 72 6f 6d 20 3d 20 22  c1 e0 f0 f1 f3 f7 e5 eb from = "
    00000020  eb ee 22 3b 0a                                   ";␊
    # Malformed UTF-8 character: \xc1\xe0 (unexpected non-continuation byte 0xe0, immediately after start byte 0xc1; need 2 bytes, got 1) at 
    

    Therefore the name utf8 for the pragma really is appropriate, and unicode is not.


    > > what encoding is it in?
    > It is in Unicode.
    Unicode is not a valid encoding name for a JSON file or JSON HTTP response body. Again, if you look at the hex dump for both formatversion=1 and formatversion=2, you will see that both are octets encoded in UTF-8.
    # formatversion=2
    00000000  7b 22 62 61 74 63 68 63  6f 6d 70 6c 65 74 65 22 {"batchcomplete"
    00000010  3a 74 72 75 65 2c 22 63  6f 6e 74 69 6e 75 65 22 :true,"continue"
    00000020  3a 7b 22 61 75 66 72 6f  6d 22 3a 22 d0 91 d0 b0 :{"aufrom":"�а
    00000030  d1 80 d1 81 d1 83 d1 87  d0 b5 d0 bb d0 bb d0 be ����елло
    

    # formatversion=1
    00000000  7b 22 62 61 74 63 68 63  6f 6d 70 6c 65 74 65 22 {"batchcomplete"
    00000010  3a 22 22 2c 22 63 6f 6e  74 69 6e 75 65 22 3a 7b :"","continue":{
    00000020  22 61 75 66 72 6f 6d 22  3a 22 5c 75 30 34 31 31 "aufrom":"\u0411
    00000030  5c 75 30 34 33 30 5c 75  30 34 34 30 5c 75 30 34 \u0430\u0440\u04
    00000040  34 31 5c 75 30 34 34 33  5c 75 30 34 34 37 5c 75 41\u0443\u0447\u
    
    To be more precise, formatversion=2 is normal/boring/genuine UTF-8, and formatversion=1 is US-ASCII (which is a proper subset of UTF-8, this means any file in US-ASCII is also valid UTF-8). formatversion=1 achieves this goal by not using character literals in the encoded JSON, instead it uses the \u character escapes (which by design use octets from the US-ASCII repertoire only: the backslash, the small letter u, and digits 0 to 9).

    Got that? There are two layers of encoding at play here. Both files/responses are UTF-8, therefore the use of the decode_json function, which expects UTF-8 octets as input, is appropriate for both. The function first decodes UTF-8 octets into Perl characters, but it then also decodes JSON \u character escapes into Perl characters.

      > I am convinced that you already know "enough to be dangerous", but not enough to arrive at the correctly modelled solution that most other Perl programmers would implement.
      I have to agree on that. The last time I extensively programmed something fully by myself and in Perl - in was one week before the capitulation in the Browser War, November 1998. I am actually surprised by myself to be able to write a working program in 2 days - 20 years after. It is amazing how much stuff can be kept at the backdoor of the mind... I'm fully fluent in Javascript though.
      I'll do break the code into minimum test cases to check all spelled advises and corrections.
      It is rather offtop for the initial question "Is there some universal Unicode+UTF8 switch?" - but if it's ok to continue in the same thread then I will continue here.

        if it's ok to continue in the same thread then I will continue here
        Thread drift is allowed. For good netiquette, also change the title in the reply form.
Re^3: Is there some universal Unicode+UTF8 switch?
by haj (Chaplain) on Sep 02, 2019 at 08:08 UTC

    > If the JSON Cyrillic is not UTF-8, what encoding is it in?

    It is in Unicode. You want UTF-8 - call like this (see formatversion changed to 1): https://ru.wikipedia.org/w/api.php?action=query&format=json&formatversion=1&list=allusers&auactiveusers&aufrom=Б The major problem of Perl as I see it (see the module name question higher) that it thinks of UTF-8 and Unicode as something of the same kind while these are two completely different things. From here all its (de|en)coding oops. IMHO.

    I'm sorry, but this is just plain wrong. Perl knows what UTF-8 is (a representation of Unicode in bytes) and what Unicode (a mapping of characters to numbers) is. This is not a problem of Perl.

    If you write that your JSON "is in Unicode" then this makes sense for a Perl string which has been properly decoded. Text strings in Perl can contain Unicode characters. For these strings the term "encoding" doesn't make any sense. Internally Perl might store them as UTF-8 encoded, but this is irrelevant for Perl users and has occasionally led users of Devel::Peek to wrong conclusions about the nature of their data. But you can not store a file (source or data) "in Unicode", and you can not get a HTTP response "in Unicode". Whenever data enter or leave the Perl program or your source code editor, you need to decide for an encoding for Unicode strings. Whenever you "see" a Unicode character in an editor window, a console or a web page: some software had to do the mapping from encoded octets to the unicode code point, and from there to the glyph which is displayed on your screen

    That said, some Microsoft programs allow to store "in Unicode" and then write the data in UTF-16-LE encoding. This often leads to confusion, as well as their use of "ANSI encoding" when they mean Windows Codepage 1252. There is no Perl pragma to tell Perl that source files are encoded in UTF-16 nor Windows-CP 1252.

Re^3: Is there some universal Unicode+UTF8 switch?
by Anonymous Monk on Sep 02, 2019 at 06:25 UTC
    here some Cyr letters but perlmonks.org replaces it with HTML-escapes

    It does that to code tags for some reason, use pre instead:

    просто другой хакер жемчуга
    
      Thank you. Actually this <code> behavior might be another illustration of that Perl's misunderstanding "UTF-8 literal" and "Unicode literal" :-)

        Except that there are no "Unicode literals".

        This has nothing to do with Perl. What we see here are different rules for the acceptable character sets of PerlMonks markup. BTST: Today I was surprised when I noticed that I could write "Rīga" in the text of an article, but not in its title.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11105420]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2020-04-04 09:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The most amusing oxymoron is:
















    Results (32 votes). Check out past polls.

    Notices?