Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

text encodings and perl

by andal (Friar)
on Nov 12, 2010 at 13:28 UTC ( #871052=perlmeditation: print w/ replies, xml ) Need Help??

The text in different languages can be represented in computer by different sequences of bytes. This is well known fact (hopefully).

It appears that quite often this introduces confusion. Below is an attempt to describe the basics of encodings handling in perl as I see it after struggling through various documents.

First of all, one should be aware of the difference between "internal form" of a string and sequence of octects. If the string is in "internal form" then perl attempts to find "characters" in it. Otherwise, perl simply works with "octets". The data for some string can be obtained from different sources. If those sources are external to perl, then perl can't know how to identify characters there (the encoding is not known to perl). So the programmer may help here using the module Encode.

The function Encode::decode is used to tell perl what is the encoding used by the "octets", so that perl can construct "internal form" of the string.

The function Encode::encode is used for the reverse operation. When you need to store some "internal form" string into external storage, then you must convert it back to octects with desired encoding.

To see if some string is the "internal form" one can use Encode::is_utf8 function. Note, it is not so important which encoding is used by the "internal form". It can be any. Important is only that it is "internal", so it shouldn't be passed to external entities.

These are the basics. There are few "short-cuts" for going from "octets" to "internal form" strings and back, like binmode(":encoding(foo)") or "use utf8", but they essentially do what Encode does to strings.

These basics helped me to understand things written in "perldoc perlunicode", "perldoc utf8", "perldoc encoding" etc. I hope that this might be of some help to others as well.

Comment on text encodings and perl
Re: text encodings and perl
by moritz (Cardinal) on Nov 12, 2010 at 13:56 UTC

    Thanks for sharing your thoughts. I have a small nit however:

    If the string is in "internal form" then perl attempts to find "characters" in it. Otherwise, perl simply works with "octets".

    This is true for the length function, but most often it's not. For functions like uc and print it's the operation that sets the context.

    Just like concatenation imposes string context, and multiplication numeric context, print imposes "binary" context (and encodes and warns if necessary), and uc imposes "character" context (and decoes with Latin-1 if the string holds undecoded octests).

    (self promotion: I've written a similar document on encodings and Unicode in Perl, though a bit longer. I hope you find it useful).

      Just like concatenation imposes string context, and multiplication numeric context, print imposes "binary" context (and encodes and warns if necessary), and uc imposes "character" context (and decoes with Latin-1 if the string holds undecoded octests).

      Well. My point was different. It is correct that perl does certain conversions behind the stage, and certain warnings are given out because perl has to produce the result. But my point was, that without the help of the developer, perl can not do 100% correct work. It just does what works most of the time. The context is imposed, but if the string is not in proper internal form, then "characters" that perl works with might be quite wrong from the developer's stand point.

      I can give you examples of bad confusion that I had in mind.

      Module MP3::Tag::ID3v2 provides method "get_frame" which returns string as sequence of octets. So to convert the encoding developer has to use "Encode::from_to". But the method "change_frame" of the same module expects string in "internal form" because internally it calls Encode::encode on the input. So the developer can't pass the string returned by "get_frame" as input to "change_frame" unless he calls "Encode::decode" on it.

      Another example. The DBD modules may return strings from databases either as octets or in "internal form". But if you pass these strings to say Gtk2 modules, then they must be only in "internal form". So the developer have to execute care what kind of output he/she gets from the DBD modules.

      I believe, that part of the confusion lays in the badly written modules. Since perl provides function "is_utf8", it is very easy to check what kind of input the user has provided and use appropriate "Encode::encode" or "Encode::decode" to get the desired form. But we have what we have, so the developers have to watch out for the type of strings they work with.

        I believe, that part of the confusion lays in the badly written modules. Since perl provides function "is_utf8", it is very easy to check what kind of input the user has provided and use appropriate "Encode::encode" or "Encode::decode" to get the desired form.

        You seem to equate "is a text string" with "is_utf8 returns true". That's wrong.

        Perl has two possible internal formats: Latin-1 and UTF-8. It is perfectly fine for decoded text string to be stored in Latin-1. Decoding it again just because is_utf8 returns false is just wrong.

        Example (on a UTF-8 terminal; note that -CS sets the :encoding(UTF-8) layer on STDOUT, among other things):

        $ perl -CS -Mstrict -wE 'say "\x{ab}oo\x{bb}"' «oo» $ perl -CS -Mstrict -wE 'say utf8::is_utf8 "\x{ab}oo\x{bb}"' $ # let's convince ourselves that lc() works properly: $ perl -CS -Mstrict -wE 'say "\x{C6}"' Æ $ perl -CS -Mstrict -wE 'say lc "\x{C6}"' æ $ perl -CS -Mstrict -wE 'say utf8::is_utf8 "\x{C6}"' $

        Summary: Strings internally stored as Latin 1 can be perfectly fine text strings. Trying to use is_utf8 to determine whether a string holds characters or octects is wrong.

        In fact, every string can be seen as a text string (which functions like lc and uc do), though if you forgot to decode the input data, the user will be surprised by the result.

Re: text encodings and perl
by graff (Chancellor) on Nov 13, 2010 at 02:45 UTC
    I'm sure that anyone getting started with unicode in perl will find your explanation useful -- nice post. But I think this part is a bit misleading:
    Note, it is not so important which encoding is used by the "internal form". It can be any. Important is only that it is "internal", so it shouldn't be passed to external entities.

    First, it actually is important that the "internal form" is (very much like) utf8 unicode. This means that ASCII characters actually are ASCII (single-byte) characters, while everything really is Unicode (*), so that:

    • the Unicode character properties work as expected in regular expressions
    • Unicode code point numerics (e.g. "\x{abcd}") can be used in regexes or double-quoted strings
    • character normalization works according to Unicode specifications (cf. Unicode::Normalize),
    • normal string sorting works according to the established Unicode code-point order
    • other collations (e.g. character sort ordering for particular languages) implement Unicode-based specifications (see various Unicode::Collate modules on CPAN).
    All that stuff tends to make multi-language string processing a lot easier.

    Second, as for passing "internal format" strings to "external entities", this isn't necessarily a problem. A "perl-internal" utf8 string can be passed for insertion into a database table via DBI without further ado, or printed directly to a file handle if the file was opened for output with the ":utf8" IO layer.

    (* Update: well, the characters in the range U+0080 - U+00FF have some "special behaviors", but they really can be treated just like any other non-ASCII character.)

      it actually is important that the "internal form" is (very much like) utf8 unicode.
      I'm not sure I can follow your arguments. Which of those desirable properties wouldn't be possible if Perl had a different internal unicode string representation? Other languages like Java or Python have chosen different internal representations, yet they are perfectly capable of doing regex matches or parsing string literals (analogous to "\x{abcd}" in Perl) into their internal form.

      It's just a matter of how things are implemented. Of course, different implementations have different pros and cons with respect to performance (speed/memory) or ease of implementation, but I don't see why utf8 would be required as the internal form to realize the properties you mentioned.

Re: text encodings and perl
by sundialsvc4 (Abbot) on Nov 15, 2010 at 19:21 UTC

    Text encodings can be very difficult to work with, partly because there is no absolutely-reliable way to detect what kind of string you are dealing with.   There are at least three different strategies in common use (my terms...)

    • Straight bytes:   “A character is a byte, and a byte is a character.”   But you do not necessarily know what printable character corresponds to a particular byte ... particularly for values beyond 127.
    • Double-byte character sets (DBCS):   Most characters are “straight bytes,” but there are a few “lead-in/lead-out characters” which introduce exceptions to that rule.   When a lead-in is seen, subsequent characters are represented by two bytes until a lead-out is seen.   (The person who devised this scheme should be drawn and quartered... but disk-drives and RAM chips were so much smaller then.)
    • n-byte encodings:   A character corresponds to n bytes, and each character corresponds to the same number of bytes.   Unicode is such a system.

    Each of these schemes requires some amount of knowledge that may not be determinable by examining just the data itself.

      ... Unicode is such a system.
      This is just so wrong. For one, Unicode is not an encoding. Rather, UTF-8, UTF-16 etc. are encodings. And a rather common one of them - UTF-8 - is variable-width, i.e. not same number of bytes per character...

        Thank you for the clarification.   I have revised the post, humbly eating my own words.

        For one, Unicode is not an encoding. Rather, UTF-8, UTF-16 etc. are encodings. And a rather common one of them — UTF-8 — is variable-width, i.e. not same number of bytes per character.

        Both UTF‑8 and also UTF‑16 as well are variable‐width encodings. The essential difference is the size of the code units. There is an infinitude of Java and Windows code (but not necessarily both) out there that screws this up, thinking that UTF‑16 is UCS‑2. It very much is not so.

        Plus UCS‑2 isn’t even a valid Unicode encoding in the first place. UTF‑8, UTF‑16, and UTF‑32 are, and of those, only the last uses fixed‐width code units. UTF‑16 is problematic and annoying in several ways that do not affect either UTF‑8 or UTF‑32, but that doesn’t make it fixed width.

        So the same statement as you’ve made about UTF‑8 applies equally well, mutatis mutandis, to UTF‑16: “UTF‑16 is also a variable‐width encoding, i.e. not the same number of 16‑bit code units per character.” It would be very, very good idea to remain ever conscious of this, given how much harm has been done by negligent programmers who have not done so.

      Each of these schemes requires some amount of knowledge that may not be determinable by examining just the data itself.

      Just to make sure. I didn't imply anywhere, that the developer should determine the encoding by examining the data. Personally I believe that guessing the encoding is a sin. It should be done only if there's no other choice. It is much better to force the user to provide the information about the encoding if it is not known already.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://871052]
Approved by moritz
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2014-12-27 09:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls