Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Is there some universal Unicode+UTF8 switch?

by VK (Novice)
on Sep 01, 2019 at 16:37 UTC ( #11105382=perlquestion: print w/replies, xml ) Need Help??

VK has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone.
My CGI scripts works with Cyrillic. "Works" means that:

  • it contains unicoded Cyrillic literals
  • it outputs unicoded Cyrillic
  • it queries for unicoded JSON Cyrillic
  • it receives and handles unicoded JSON Cyrillic (not UTF8-encoded sequences) like this: https://ru.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&list=allusers&auactiveusers&aufrom=Б
It is by now all fine and working but in order to achieve it I had to make my script like a drunk buddy I have to get from the bar back home :-) - once I lower attention, he tries to fell on the ground and sleep.
  • I have use utf8; for script literals
  • I have binmode STDOUT, ':utf8'; for "wide character" warnings
  • I have JSON->new->utf8(0)->decode($response->content) for LWP query results, a simple decode_json from JSON module somewhere somehow mangles things
  • I have my $unicode_literal = decode('utf-8', $data->{result}) from Encode module or else chars get jammed
  • And a couple more things like this in other places
So my question is: does modern Perl have some "global super-mode use utf8"? So it would be like use utf8 - once placed at the beginning. But the effect would be to kill in the program any memories of any encoding but Unicode. So literally would mean "in this program there is nothing but Unicode. You get Unicode, you output Unicode. There is nothing in this world but Unicode." Something like this.

Replies are listed 'Best First'.
Re: Is there some universal Unicode+UTF8 switch?
by daxim (Curate) on Sep 01, 2019 at 17:17 UTC
    Try utf8::all. It's not universal, because it handles only the core functionality, not libraries. Your use case can be much simplified, though. I strongly suspect you have too much code. Consider:
    • HTTP::Response provides both content (returns octets) and decoded_content (returns characters, appropriately decoded from Content-Type header).
    • decode_json wants to consume octets.
    This means the following DWYW:
    use LWP::UserAgent qw(); use JSON::MaybeXS qw(decode_json); my $ua = LWP::UserAgent->new; my $res = $ua->get('https://ru.wikipedia.org/w/api.php?action=query&fo +rmat=json&formatversion=2&list=allusers&auactiveusers&aufrom=%D0%91') +; die $res->status_line unless $res->is_success; my $json_OCTETS = $res->content; my $all_users_CHARACTERS = decode_json $json_OCTETS; my $continue_aufrom_CHARACTERS = $all_users_CHARACTERS->{continue}{auf +rom};
    Your CGI script's templating system should take care to produce UTF-8 encoded octets. If you don't have one, then either one of
    • use Encode qw(encode); my $continue_aufrom_OCTETS = encode('UTF-8', $continue_aufrom_CHARACTE +RS, Encode::FB_CROAK); STDOUT->print($continue_aufrom_OCTETS);
    • binmode STDOUT, ':encoding(UTF-8)'; STDOUT->print($continue_aufrom_CHARACTERS);
    is appropriate. The first variant is more robust.

      use utf8::all; sounds the most promising, thank you and I'll try it. The reason I didn't use it yet is that the main doc https://perldoc.perl.org/utf8.html doesn't have a single mention of this option - so either you know about utf8::all in advance, or you are out of luck.

      The JSON function shortcut decode_json has UTF8 decoding hardcoded to "on". To make it "off" and to avoid double encoding I had to use the full call like JSON->new->utf8(0)->decode($response->content) If utf8::all solves this problem as well, then I can use the function shortcut. I will check everything later today.

      (Update) Noop, I rechecked - only the current long code reliably working for non-ASCII. For the sample URL above I do my $response = LWP call and then

      1. my $data1 = JSON->new->utf8(0)->decode($response->content);
      2. my $data2 = decode_json($response->content);
      3. my $data3 = $response->decoded_content;
      and then my $test = $data1->{query}->{allusers}[0]->{name};

      1) is always working for my needs. 2) is woking if called in some obvious scalar context. One tries slice referenced array or anything complex - it falls to the "Perl branded jam" with and the like. 3) is stably DOA (dead on arrival) so the same 2) but right away.

      So utf8::all should be written and extended to some utf8::all_throughout "Written" means to the reliability and stability level to be included in prominent Perl distributions. Until then the answer to my initial question seems negative.

        slice referenced array or anything complex - it falls to the "Perl branded jam"
        I'm sceptical about that claim. Show your code.
      Just stylin:
      STDOUT->binmode(':encoding(UTF-8)'); STDOUT->print($continue_aufrom_CHARACTERS);
Re: Is there some universal Unicode+UTF8 switch?
by davido (Cardinal) on Sep 01, 2019 at 20:56 UTC

    There may be a or some switches but none that absolve the programmer from the need to think things through. Witness another post from tchrist (another that leaves me with a feeling of smallness and inadequacy, despite good intentions) dealing with the things one must take into consideration for Unicode to be more or less "all on."

    Go Thou and Do Likewise


    Dave

Re: Is there some universal Unicode+UTF8 switch?
by haj (Chaplain) on Sep 02, 2019 at 02:03 UTC
    So my question is: does modern Perl have some "global super-mode use utf8"? So it would be like use utf8 - once placed at the beginning. But the effect would be to kill in the program any memories of any encoding but Unicode. So literally would mean "in this program there is nothing but Unicode. You get Unicode, you output Unicode. There is nothing in this world but Unicode." Something like this.

    Such a general switch is likely to break things. But first, let me clarify: Unicode is not utf8. Unicode is a concept that assigns a number ("code point") to every character considered worthy by the Unicode consortium. UTF-8 is a recipe to map strings built from those characters to sequences of octets - and interpret sequences of octets as a Unicode string.

    This distinction is important because it makes clear that the concept of "Unicode", and of its encodings, only applies to text. Binary data in a Perl program are not Unicode strings, and binary data in files or read with LWP are no valid UTF-8 (most of the times).

    Your code JSON->new->utf8(0)->decode($response->content) looks just wrong. You should investigate how the correct invocation somehow mangles things. Step by step:

    • $response->content from LWP provides octets. The web server must specify the content type and encoding, and this is the encoding which must be used to interpret these octets, regardless from any settings in the program.
    • The combination JSON->new->utf8(0)->decode(...) expects a Unicode string, and you're feeding bytes. So, these bytes are fed into your JSON structure.
    • This would explain why you need to run $unicode_literal = decode('utf-8', $data->{result}) on your JSON fields.
    These are two errors cancelling each other. You need to get rid of both of them, and I'm sure the program gets easier to understand after that.
Re: Is there some universal Unicode+UTF8 switch?
by Anonymous Monk on Sep 01, 2019 at 17:12 UTC

      woops... The most recent Strawberry Perl doesn't have it. use utf8 - OK, use utf8::all - gives "Can't locate utf8/all.pm in @INC (you may need to install the utf8::all module)". The same with all my current providers (500 Internal Server Error)

      So the idea behind utf8::all looks attractive but either it was never properly implemented or had to be removed from all default sets.

        The most recent Strawberry Perl doesn't have it

        You are right, Strawberry Perl doesn't include that module: it's not part of core Perl, nor of Strawberry's default list of non-core modules that it includes when you first install Strawberry Perl; however, it is available via CPAN. Fortunately, Strawberry does include a variety of CPAN clients, including cpan and cpanm (I personally prefer cpanm). You can use one of those tools to install utf8::all using cpanm utf8:all (or appropriate syntax for your client).

        You'll find that most modules are available this way, rather than pre-bundled with your perl distribution, so keep this in mind as you hear about other modules that you might want to use in this or future projects.

Re: Is there some universal Unicode+UTF8 switch?
by jcb (Deacon) on Sep 01, 2019 at 23:25 UTC
    • it contains unicoded Cyrillic literals

      So your script itself is in UTF-8?

    • it outputs unicoded Cyrillic

      And you are selecting UTF-8 output... are you correctly declaring the output to be UTF-8 in the HTTP headers?

    • it queries for unicoded JSON Cyrillic

    • it receives and handles unicoded JSON Cyrillic (not UTF8-encoded sequences) like this: https://ru.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&list=allusers&auactiveusers&aufrom=Б

      This last item raises a big question: If the JSON Cyrillic is not UTF-8, what encoding is it in? You may need the Encode module to perform character set conversion prior to JSON decoding.

Re: Is there some universal Unicode+UTF8 switch?
by Anonymous Monk on Sep 02, 2019 at 05:11 UTC
    Perhaps it will help to systematize things a bit. Text can come to your program from various sources:

    1) String literals in the source code. Solution: put use utf8 on top of all your files.

    2) Files/sockets that you open (or that are opened for you, like STDIN). Solution: binmode, or open my $fh, '<:encoding(utf-8)', $name.

    3) Various system messages, errors and the like, such as $! (as in open my $fh, '<', $name or die $!. The $! gives you the output of C's strerror(), a string that contains who-knows-what in unspecified encoding... Once upon a time perl liked to double encode such things but I think it doesn't anymore?

    4) Modules, written by other people. This is the real problem. There is no solution. You need to read the documentation (or the source code) and do the right thing.

    Thus, there is no super-switch. It would be good if there was, but there isn't. utf8:all helps with the first 3 problems, but it doesn't do anything with the 4th. Use Devel::Peek for debugging, unfortunately, that's still the best thing there is.

      A good compilation, to which I'd like to add (or maybe in the cases 6 and 7 just expand on your fourth entry):

      5) @ARGV and %ENV. utf8::all does convert @ARGV but does not touch %ENV. Good luck when you're on Windows where per default the terminal doesn't use UTF-8 encoding.

      6) Database fields. Encoding of these is often defined outside of the Perl world. The driver docs should tell you how to handle encoding.

      7) Evaluating binary data in your program: Unzipping compressed data, decrypting secret stuff, and parsing ASN.1 or image metadata may all return (encoded) texts.

      I'm also not too happy with using Devel::Peek for debugging encoding issues. It provides too much useless information, sometimes misleading, and is difficult to read (like PV = 0x5629d24aaa30 "\303\244"\0 [UTF8 "\x{e4}"] versus PV = 0x5629d2414060 "\344"\0). I'd rather write suspicious strings to a file, using UTF-8 encoding, and examine this file with an editor which is capable of UTF-8 and hex display.

      I'm also using some regular expressions in debugging:

      my $utf8_decodable_regex = qr/[\xC0-\xDF][\x80-\xBF] | # 2 bytes unicode char [\xE0-\xEF][\x80-\xBF]{2} | # 3 bytes unicode char [\xF0-\xFF][\x80-\xBF]{3}/x; sub contains_decodable_utf8 { $_[0] =~ /$utf8_decodable_regex/; } sub is_utf8_decodable { $_[0] =~ /\A($utf8_decodable_regex|[[:ascii:]])*\z/; }
      • If contains_decodable_utf8($string) is false, then you should be fine.
      • If is_utf8_decodable($string) is true, then you can (and should) decode the string.
      • If contains_decodable_utf8($string) is true but is_utf8_decodable($string) is false, then you either have binary data (which might be just fine) or you have already mixed up encodings. Go back in your code and check what you did to $string before.

      > So your script itself is in UTF-8?
      Well, not really. Other already reminded to us that UTF-8 is merely a "transport protocol" for multibyte chars - not what you see and get in the text. So if I have in my code like my $Cyrillic_literal = 'here some Cyr letters but perlmonks.org replaces it with HTML-escapes'; - it is not UTF-8, it is Unicode. If you select and copy that here some Cyr letters but perlmonks.org replaces it with HTML-escapes - you buffer will contain Unicode string, not UTF-8 string. So the proper word would be use unicode; and not use utf8; But it is not me who has chosen the module name, I'm just an end user. It can be called utf8 or even foobar - as long as does the needed (Unicode handling) I do not care too much.

      > If the JSON Cyrillic is not UTF-8, what encoding is it in?
      It is in Unicode. You want UTF-8 - call like this (see formatversion changed to 1): https://ru.wikipedia.org/w/api.php?action=query&format=json&formatversion=1&list=allusers&auactiveusers&aufrom=Б
      The major problem of Perl as I see it (see the module name question higher) that it thinks of UTF-8 and Unicode as something of the same kind while these are two completely different things. From here all its (de|en)coding oops. IMHO.

        Your understanding of the topic is still inadequate. I am convinced that you already know "enough to be dangerous", but not enough to arrive at the correctly modelled solution that most other Perl programmers would implement. It's quite difficult to suss out the parts where you still need education or clearing up a misunderstanding, but I will try anyway.

        One half of what the utf8 pragma does, is allowing to use non-ASCII character literals in the source code, for example you could simply write

        use utf8; my $aufrom = "Барсучелло";
        instead of the much more tedious
        my $aufrom = "\N{CYRILLIC CAPITAL LETTER BE}\N{CYRILLIC SMALL LETTER A}\N{CYRILLIC SMALL LETTER ER}\N{CYRILLIC SMALL LETTER ES}\N{CYRILLIC SMALL LETTER U}\N{CYRILLIC SMALL LETTER CHE}\N{CYRILLIC SMALL LETTER IE}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER EL}\N{CYRILLIC SMALL LETTER O}";
        or unreadable
        my $aufrom = "\N{U+0411}\N{U+0430}\N{U+0440}\N{U+0441}\N{U+0443}\N{U+0447}\N{U+0435}\N{U+043B}\N{U+043B}\N{U+043E}";

        The other half of the utf8 pragma is that Perl expects the source code to be encoded in UTF-8. To be clear, that's the encoding option in the text editor. A hex dump of the file with the first example from above would correctly look like:

        00000000  75 73 65 20 75 74 66 38  3b 20 6d 79 20 24 61 75 use utf8; my $au
        00000010  66 72 6f 6d 20 3d 20 22  d0 91 d0 b0 d1 80 d1 81 from = "��
        00000020  d1 83 d1 87 d0 b5 d0 bb  d0 bb d0 be 22 3b 0a    ��елло";␊
        
        If you would mistakenly save the text as Windows-1251, the program would not work correctly anymore.
        00000000  75 73 65 20 75 74 66 38  3b 20 6d 79 20 24 61 75 use utf8; my $au
        00000010  66 72 6f 6d 20 3d 20 22  c1 e0 f0 f1 f3 f7 e5 eb from = "
        00000020  eb ee 22 3b 0a                                   ";␊
        # Malformed UTF-8 character: \xc1\xe0 (unexpected non-continuation byte 0xe0, immediately after start byte 0xc1; need 2 bytes, got 1) at 
        

        Therefore the name utf8 for the pragma really is appropriate, and unicode is not.


        > > what encoding is it in?
        > It is in Unicode.
        Unicode is not a valid encoding name for a JSON file or JSON HTTP response body. Again, if you look at the hex dump for both formatversion=1 and formatversion=2, you will see that both are octets encoded in UTF-8.
        # formatversion=2
        00000000  7b 22 62 61 74 63 68 63  6f 6d 70 6c 65 74 65 22 {"batchcomplete"
        00000010  3a 74 72 75 65 2c 22 63  6f 6e 74 69 6e 75 65 22 :true,"continue"
        00000020  3a 7b 22 61 75 66 72 6f  6d 22 3a 22 d0 91 d0 b0 :{"aufrom":"�а
        00000030  d1 80 d1 81 d1 83 d1 87  d0 b5 d0 bb d0 bb d0 be ����елло
        

        # formatversion=1
        00000000  7b 22 62 61 74 63 68 63  6f 6d 70 6c 65 74 65 22 {"batchcomplete"
        00000010  3a 22 22 2c 22 63 6f 6e  74 69 6e 75 65 22 3a 7b :"","continue":{
        00000020  22 61 75 66 72 6f 6d 22  3a 22 5c 75 30 34 31 31 "aufrom":"\u0411
        00000030  5c 75 30 34 33 30 5c 75  30 34 34 30 5c 75 30 34 \u0430\u0440\u04
        00000040  34 31 5c 75 30 34 34 33  5c 75 30 34 34 37 5c 75 41\u0443\u0447\u
        
        To be more precise, formatversion=2 is normal/boring/genuine UTF-8, and formatversion=1 is US-ASCII (which is a proper subset of UTF-8, this means any file in US-ASCII is also valid UTF-8). formatversion=1 achieves this goal by not using character literals in the encoded JSON, instead it uses the \u character escapes (which by design use octets from the US-ASCII repertoire only: the backslash, the small letter u, and digits 0 to 9).

        Got that? There are two layers of encoding at play here. Both files/responses are UTF-8, therefore the use of the decode_json function, which expects UTF-8 octets as input, is appropriate for both. The function first decodes UTF-8 octets into Perl characters, but it then also decodes JSON \u character escapes into Perl characters.

        > If the JSON Cyrillic is not UTF-8, what encoding is it in?

        It is in Unicode. You want UTF-8 - call like this (see formatversion changed to 1): https://ru.wikipedia.org/w/api.php?action=query&format=json&formatversion=1&list=allusers&auactiveusers&aufrom=Б The major problem of Perl as I see it (see the module name question higher) that it thinks of UTF-8 and Unicode as something of the same kind while these are two completely different things. From here all its (de|en)coding oops. IMHO.

        I'm sorry, but this is just plain wrong. Perl knows what UTF-8 is (a representation of Unicode in bytes) and what Unicode (a mapping of characters to numbers) is. This is not a problem of Perl.

        If you write that your JSON "is in Unicode" then this makes sense for a Perl string which has been properly decoded. Text strings in Perl can contain Unicode characters. For these strings the term "encoding" doesn't make any sense. Internally Perl might store them as UTF-8 encoded, but this is irrelevant for Perl users and has occasionally led users of Devel::Peek to wrong conclusions about the nature of their data. But you can not store a file (source or data) "in Unicode", and you can not get a HTTP response "in Unicode". Whenever data enter or leave the Perl program or your source code editor, you need to decide for an encoding for Unicode strings. Whenever you "see" a Unicode character in an editor window, a console or a web page: some software had to do the mapping from encoded octets to the unicode code point, and from there to the glyph which is displayed on your screen

        That said, some Microsoft programs allow to store "in Unicode" and then write the data in UTF-16-LE encoding. This often leads to confusion, as well as their use of "ANSI encoding" when they mean Windows Codepage 1252. There is no Perl pragma to tell Perl that source files are encoded in UTF-16 nor Windows-CP 1252.

        here some Cyr letters but perlmonks.org replaces it with HTML-escapes

        It does that to code tags for some reason, use pre instead:

        просто другой хакер жемчуга
        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11105382]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2019-12-05 23:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (154 votes). Check out past polls.

    Notices?