Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^2: text encodings and perl

by andal (Hermit)
on Nov 13, 2010 at 21:06 UTC ( [id://871256]=note: print w/replies, xml ) Need Help??


in reply to Re: text encodings and perl
in thread text encodings and perl

Just like concatenation imposes string context, and multiplication numeric context, print imposes "binary" context (and encodes and warns if necessary), and uc imposes "character" context (and decoes with Latin-1 if the string holds undecoded octests).

Well. My point was different. It is correct that perl does certain conversions behind the stage, and certain warnings are given out because perl has to produce the result. But my point was, that without the help of the developer, perl can not do 100% correct work. It just does what works most of the time. The context is imposed, but if the string is not in proper internal form, then "characters" that perl works with might be quite wrong from the developer's stand point.

I can give you examples of bad confusion that I had in mind.

Module MP3::Tag::ID3v2 provides method "get_frame" which returns string as sequence of octets. So to convert the encoding developer has to use "Encode::from_to". But the method "change_frame" of the same module expects string in "internal form" because internally it calls Encode::encode on the input. So the developer can't pass the string returned by "get_frame" as input to "change_frame" unless he calls "Encode::decode" on it.

Another example. The DBD modules may return strings from databases either as octets or in "internal form". But if you pass these strings to say Gtk2 modules, then they must be only in "internal form". So the developer have to execute care what kind of output he/she gets from the DBD modules.

I believe, that part of the confusion lays in the badly written modules. Since perl provides function "is_utf8", it is very easy to check what kind of input the user has provided and use appropriate "Encode::encode" or "Encode::decode" to get the desired form. But we have what we have, so the developers have to watch out for the type of strings they work with.

Replies are listed 'Best First'.
Re^3: text encodings and perl
by moritz (Cardinal) on Nov 14, 2010 at 09:59 UTC
    I believe, that part of the confusion lays in the badly written modules. Since perl provides function "is_utf8", it is very easy to check what kind of input the user has provided and use appropriate "Encode::encode" or "Encode::decode" to get the desired form.

    You seem to equate "is a text string" with "is_utf8 returns true". That's wrong.

    Perl has two possible internal formats: Latin-1 and UTF-8. It is perfectly fine for decoded text string to be stored in Latin-1. Decoding it again just because is_utf8 returns false is just wrong.

    Example (on a UTF-8 terminal; note that -CS sets the :encoding(UTF-8) layer on STDOUT, among other things):

    $ perl -CS -Mstrict -wE 'say "\x{ab}oo\x{bb}"' «oo» $ perl -CS -Mstrict -wE 'say utf8::is_utf8 "\x{ab}oo\x{bb}"' $ # let's convince ourselves that lc() works properly: $ perl -CS -Mstrict -wE 'say "\x{C6}"' Æ $ perl -CS -Mstrict -wE 'say lc "\x{C6}"' æ $ perl -CS -Mstrict -wE 'say utf8::is_utf8 "\x{C6}"' $

    Summary: Strings internally stored as Latin 1 can be perfectly fine text strings. Trying to use is_utf8 to determine whether a string holds characters or octects is wrong.

    In fact, every string can be seen as a text string (which functions like lc and uc do), though if you forgot to decode the input data, the user will be surprised by the result.

      You seem to equate "is a text string" with "is_utf8 returns true". That's wrong.

      No. I don't. I just said, that when "is_utf8" returns true, then perl thinks that it knows which is the encoding of text in the string. If the return value is "false", then perl does not know which encoding is really used, so it may use Latin1, or whatever is found working "most of the time". Of course, text string stays text string independent of is_utf8 flag. Just what perl can do with this string differs.

      Just to illustrate it. Try to use in your examples russian letters written as UTF-8 strings and then apply "lc" to those strings. Ie.

      $ perl -CS -Mstrict -wE 'say lc "\xd0\xa7"'
      $ perl -CS -Mstrict -wE 'say "\xdo\xa7"'
      
      This displays garbage instead of russian letter "Ч". Change it to
      $ perl -CS -Mstrict -MEncode -wE 'say lc Encode::decode("UTF-8", "\xd0\xa7")'
      $ perl -CS -Mstrict -MEncode -wE 'say Encode::decode("UTF-8", "\xdo\xa7")'
      
      And you'll get the correct output. In fact, in your examples perl effectively calls the Encode::decode but with parameter "Latin1" instead of "UTF-8", that is why your letter (from Latin1) is displayed correctly, but mine (UTF-8) is not.

        If the return value is "false", then perl does not know which encoding is really used, so it may use Latin1, or whatever is found working "most of the time"

        I'm curious how you came to that conclusion. For any text operation, perl has to assume an encoding. It uses UTF-8 if the utf8 flag is present on the string, and Latin 1 otherwise (assuming you didn't mess with locales). It has no notion of "working" and "most of the time".

        I'm well aware of when I need to decode, and when not. And my point was that deciding this question based on the return value of is_utf8 is wrong.

        In fact, in your examples perl effectively calls the Encode::decode but with parameter "Latin1" instead of "UTF-8"

        It doesn't. Because Latin-1 strings themselves can be perfectly fine text strings.

        If you don't believe me, add a warn to the Encode::decode() function of your local perl installation. You'll observe no such call.

      Summary: Strings internally stored as Latin 1 can be perfectly fine text strings. Trying to use is_utf8 to determine whether a string holds characters or octects is wrong.

      Well. I never said anything against this truth. I guess the misunderstanding comes from the use of terms "characters" and "octets". These terms are used by perlunicode so I've used them here as well. In no way I'm implying that strings with utf8 flags will never have "characters". Of course perl will find "characters" in those strings in the contexts where it shall find "characters". The opposite is also true, perl will find "octects" in the strings with utf8 flag set, when the context demands it.

      In original writing word "character" stood for CORRECT characters, not just some deduced characters. So, if the developer called Encode::decode then the character values will be correct, otherwise they'll be correct only if the octets happen to use Latin1 encoding. I hope this clarifies things.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://871256]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (2)
As of 2024-04-20 03:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found