Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: text encodings and perl

by moritz (Cardinal)
on Nov 12, 2010 at 13:56 UTC ( #871057=note: print w/ replies, xml ) Need Help??


in reply to text encodings and perl

Thanks for sharing your thoughts. I have a small nit however:

If the string is in "internal form" then perl attempts to find "characters" in it. Otherwise, perl simply works with "octets".

This is true for the length function, but most often it's not. For functions like uc and print it's the operation that sets the context.

Just like concatenation imposes string context, and multiplication numeric context, print imposes "binary" context (and encodes and warns if necessary), and uc imposes "character" context (and decoes with Latin-1 if the string holds undecoded octests).

(self promotion: I've written a similar document on encodings and Unicode in Perl, though a bit longer. I hope you find it useful).


Comment on Re: text encodings and perl
Re^2: text encodings and perl
by andal (Friar) on Nov 13, 2010 at 21:06 UTC
    Just like concatenation imposes string context, and multiplication numeric context, print imposes "binary" context (and encodes and warns if necessary), and uc imposes "character" context (and decoes with Latin-1 if the string holds undecoded octests).

    Well. My point was different. It is correct that perl does certain conversions behind the stage, and certain warnings are given out because perl has to produce the result. But my point was, that without the help of the developer, perl can not do 100% correct work. It just does what works most of the time. The context is imposed, but if the string is not in proper internal form, then "characters" that perl works with might be quite wrong from the developer's stand point.

    I can give you examples of bad confusion that I had in mind.

    Module MP3::Tag::ID3v2 provides method "get_frame" which returns string as sequence of octets. So to convert the encoding developer has to use "Encode::from_to". But the method "change_frame" of the same module expects string in "internal form" because internally it calls Encode::encode on the input. So the developer can't pass the string returned by "get_frame" as input to "change_frame" unless he calls "Encode::decode" on it.

    Another example. The DBD modules may return strings from databases either as octets or in "internal form". But if you pass these strings to say Gtk2 modules, then they must be only in "internal form". So the developer have to execute care what kind of output he/she gets from the DBD modules.

    I believe, that part of the confusion lays in the badly written modules. Since perl provides function "is_utf8", it is very easy to check what kind of input the user has provided and use appropriate "Encode::encode" or "Encode::decode" to get the desired form. But we have what we have, so the developers have to watch out for the type of strings they work with.

      I believe, that part of the confusion lays in the badly written modules. Since perl provides function "is_utf8", it is very easy to check what kind of input the user has provided and use appropriate "Encode::encode" or "Encode::decode" to get the desired form.

      You seem to equate "is a text string" with "is_utf8 returns true". That's wrong.

      Perl has two possible internal formats: Latin-1 and UTF-8. It is perfectly fine for decoded text string to be stored in Latin-1. Decoding it again just because is_utf8 returns false is just wrong.

      Example (on a UTF-8 terminal; note that -CS sets the :encoding(UTF-8) layer on STDOUT, among other things):

      $ perl -CS -Mstrict -wE 'say "\x{ab}oo\x{bb}"' oo $ perl -CS -Mstrict -wE 'say utf8::is_utf8 "\x{ab}oo\x{bb}"' $ # let's convince ourselves that lc() works properly: $ perl -CS -Mstrict -wE 'say "\x{C6}"' $ perl -CS -Mstrict -wE 'say lc "\x{C6}"' $ perl -CS -Mstrict -wE 'say utf8::is_utf8 "\x{C6}"' $

      Summary: Strings internally stored as Latin 1 can be perfectly fine text strings. Trying to use is_utf8 to determine whether a string holds characters or octects is wrong.

      In fact, every string can be seen as a text string (which functions like lc and uc do), though if you forgot to decode the input data, the user will be surprised by the result.

        You seem to equate "is a text string" with "is_utf8 returns true". That's wrong.

        No. I don't. I just said, that when "is_utf8" returns true, then perl thinks that it knows which is the encoding of text in the string. If the return value is "false", then perl does not know which encoding is really used, so it may use Latin1, or whatever is found working "most of the time". Of course, text string stays text string independent of is_utf8 flag. Just what perl can do with this string differs.

        Just to illustrate it. Try to use in your examples russian letters written as UTF-8 strings and then apply "lc" to those strings. Ie.

        $ perl -CS -Mstrict -wE 'say lc "\xd0\xa7"'
        $ perl -CS -Mstrict -wE 'say "\xdo\xa7"'
        
        This displays garbage instead of russian letter "Ч". Change it to
        $ perl -CS -Mstrict -MEncode -wE 'say lc Encode::decode("UTF-8", "\xd0\xa7")'
        $ perl -CS -Mstrict -MEncode -wE 'say Encode::decode("UTF-8", "\xdo\xa7")'
        
        And you'll get the correct output. In fact, in your examples perl effectively calls the Encode::decode but with parameter "Latin1" instead of "UTF-8", that is why your letter (from Latin1) is displayed correctly, but mine (UTF-8) is not.

        Summary: Strings internally stored as Latin 1 can be perfectly fine text strings. Trying to use is_utf8 to determine whether a string holds characters or octects is wrong.

        Well. I never said anything against this truth. I guess the misunderstanding comes from the use of terms "characters" and "octets". These terms are used by perlunicode so I've used them here as well. In no way I'm implying that strings with utf8 flags will never have "characters". Of course perl will find "characters" in those strings in the contexts where it shall find "characters". The opposite is also true, perl will find "octects" in the strings with utf8 flag set, when the context demands it.

        In original writing word "character" stood for CORRECT characters, not just some deduced characters. So, if the developer called Encode::decode then the character values will be correct, otherwise they'll be correct only if the octets happen to use Latin1 encoding. I hope this clarifies things.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://871057]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2014-07-26 16:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (178 votes), past polls