http://blog.360.yahoo.com/blog-8_S91Lc7dKj4rz8iueEaVlexawc_ZFVpd4JK?p=18 Perl Unicode in Practice **In my previous post, I talked about some reasons for common problems I've had with Unicode text in Perl. Here, I'll discuss how to deal with them. This all applies to Perl 5.8 and up. Unicode in Perl 5.6 is a bitch. Before 5.6, don't even try. In Perl, a string is either a sequence of raw bytes, or a proper Unicode string. An internal flag (I call it the UTF8 flag) indicates a Unicode string. If you don't like pain, then you should make sure that all your strings are Unicode strings. Unless you tell Perl otherwise, input from files, sockets, databases, etc. are not made into Unicode strings, but are rather treated as latin1-encoded byte streams. Here's how to make sure your input is correctly being treated as UTF8, and that all your strings are Unicode strings. - first, make sure that you put your IO streams, including STDIN and STDOUT, in UTF8 mode, using binmode(HANDLE, ":utf8"). If your input is not UTF8, then read "perldoc encoding". - insert statements at various points in your program that verify that your string is a valid Unicode string containing the data you think it contains. Perform these verifications after receiving data from an external source such as a database or a socket, which will often return raw UTF8 bytes and not a Unicode string; or anywhere else your fine-tuned debugging instincts indicate. my $string = ...get string from somewhere...; use Encode; Encode::is_utf8($string) or confess "This string is not a Unicode string (doesn't have the UTF8 flag on)!" - if your string is a Unicode string, find out if it's valid. It's pretty common for the UTF8 flag to be accidently turned on in a string of, say, latin1 bytes. If this happens, you'll get a corrupt string, but Perl may not complain about it. So do: utf8::valid($string) or confess "Corrupt UTF8 string!"; - if your string is not a Unicode string, convert it! Encode::decode will take a string of bytes (without the UTF8 flag), interpret it according to the given encoding, and produce a valid Unicode string. my $unicode_string = Encode::decode( 'latin1', $string, 1); Note it's extremely common for a string to be encoded in UTF8, /even though the UTF8 flag is not on/. This is because, unless your explicitly tell Perl that your data is UTF8, it won't turn the UTF8 flag on. If you are reading UTF8 bytes from a socket, a database, etc., then Perl won't know they are UTF8 unless you do this. my $unicode_string = Encode::decode( 'utf8', $string, 1); The last argument tells Encode to die if the string can't be interpreted as the given encoding. Any sequence of bytes can be interpreted as latin1 and other single-byte encodings, so decode will never die for these. - Don't know what encoding your string is using? Try all of them: # This should already have been done # Your terminal must support UTF8! binmode(STDOUT, ":utf8"); my @all_encodings = Encode->encodings(":all"); ENCODING: foreach my $enc (@all_encodings) { # Try interpreting bytes under this encoding. # Make copy of original string. # Dies or returns blank on error. my $decoded = eval { Encode::decode($enc, my $copy = $string, 1) }; next ENCODING if $@ or not $decoded; print "Interpreted under $enc: [$decoded]\n"; } *Putting it all together* Before the Unicode days, the standard encoding for Perl was latin1, so you may want your code to support latin1 byte strings, which are perfectly valid, in addition to Unicode strings. The following subroutine will verify that you have either a valid Unicode or a valid latin1 string, and will output a valid Unicode string. It will also tell you if you suffer from the common situation of having a sequence of UTF8 bytes without the UTF8 flag on. use Carp; use Encode; use utf8; sub unicode_string { my ($string, %args) = @_; return unless defined $string; if( Encode::is_utf8( $string ) ) { # string has utf8 flag on croak "Not a valid Unicode string: the UTF8 flag is set, but the string doesn't appear to be valid UTF8 data: [$string]" if not utf8::valid($string); return $string; } else { my $as_utf8 = eval { Encode::decode( 'utf8', my $copy = $string, 1); }; # If there is a valid UTF8 interpretation that is different from the latin1 interpretation, # it can only mean that the string contains UTF8 continuation sequences if(!$@ and $as_utf8 ne $string) { warn <<"EOC"; This string contains valid UTF8 continuation sequences, but Perl thinks it's latin1. About one in a zillion latin1 strings can be interpreted as valid UTF8 with continuation sequences, and those that do look like garbage. Interpreted as latin1, your string is: [$string]. Interpreted as utf8, your string: [$as_utf8]. If the latter is correct, use Encode::decode( 'utf8', \$string ) to convert the raw utf8 bytes into a valid Unicode string. EOC } return Encode::decode( 'latin1', $string ); } } So in sum: * Put all IO streams in UTF8 mode, unless your IO is not UTF8 * Sprinkle calls to the unicode_string subroutine throughout your code both to identify problems and normalize all your strings to Unicode strings * When unicode_string dies, figure out what your string is by looking at it under many different encodings using sample code above. Then use Encode::decode toconvert your strings to Unicode strings **#### Part 2 ############################################# *The Case of the Capital A Tilde* If Unicode in Perl has ever caused you grief, then it is likely that you have often seen, whether you noticed it or not, this shady character, lurking about in the rifraff of munged text: Capital A Tilde>> � Now, I've never dealt with text that actually uses a capital a with a squiggly over it on /purpose/ (it's used in Portuguese, Vietnamese, and Kashubian). So if you are seeing it in your text (likely accompanied by one of his chronies ¡, ©, ³, º, ¨, ¬, ², and ¹, among others), and your text is not Portuguese, then probably: *Somewhere, somehow, UTF8-encoded text has been mistaken for latin-1* Latin1 and UTF8 are similar in that they are both supersets of ASCII. In both cases, bytes in the range 1-127 are interpreted just as they are in ASCII. But in UTF8, bytes greater than 127 act as escape bytes and the following bytes determine what the character is. In Latin1 these bytes just represent additional latin characters. In UTF8, the character é is represented by the byte 0xC3 (195) followed by 0xA9 (169). In Latin1, the bytes 0xC3 and 0xA9 represent the characters '�' and '©', respectively. So if a text editor or terminal that understands latin1 looks at your utf8-encoded text, it will see '�' and '©', not an 'é'. Most accented European characters are represented by 0xC3 followed by one other character. So that's why Mr. � is popping up all the time. *Double-Encoded UTF8* In your battle with Mr. �, you may be baffled to find him infiltrating your text even when you are /positive/ that your terminal or text editor is in UTF8 mode. How can this happen? The conclusion can only be that, incredibly, somewhere, something has actually output the /UTF8/ byte sequence for an �. But Why? Let's say you have a file containing the word "décor", correctly UTF8-encoded. You open the file, but don't tell Perl that the file is UTF8 by calling binmode. Mistake. Perl assumes the file contains latin1 text, and reads the file as d�©cor The following code snippit illustrates open my $handle, "<", $filename or croak "Error opening file $filename: $!"; my $text = <$handle>; # $text contains munged text if $filename was UTF8 Now, say you print the string to STDOUT, but this time STDOUT /is/ in UTF8 mode, meaning that it will encode strings as UTF8 before writing them to the stream: binmode(STDOUT, ':utf8'); print "$text\n"; So Perl knows it must encode the string "d�©cor" as UTF8. So it finds the correct UTF8 encodings for � and for © and writes these to STDOUT. So if your terminal is in UTF8 mode, it will interpret the UTF8 sequences for � and © as, surprise, � and ©, and you will see, again d�©cor So this is what I mean by double-encoded UTF8. Your original file was (singly) encoded as UTF8, with the é encoded as two bytes. But then this two bytes were treated as two separate characters, and then these were again encoded into UTF8. So, when you see Mr. �, it means that your terminal or text editor is either * interpreting /correct/ UTF8 text /incorrectly/ (as latin1) * interpreting /incorrect/ UTF8 (double-encoded) text /correctly/ Now, you may think that, to fix the problem, you should take away the line binmode(STDOUT, ':utf8'). That, will, indeed, /appear/ to fix the problem. Perl will read the text as latin-1 for "d�©cor", and then write out latin-1 for "d�©cor." Then, your terminal will interpret this as a utf8 byte sequence and show you "décor." Things look like they work. Here, what's happening, is that your terminal or text editor is: * interpreting /incorrect/ UTF8 text /incorrectly/ In other words, your Perl program incorrectly interpreted the UTF8 text as latin1, but then your terminal or text editor incorrectly interpreted this latin1 text as UTF8, reversing the first mistake. And so the results look correct. This is a beastly situation, and this sort of thing is happening in Perl programs around the world right now. Things look fine, but people don't realize all the bugs that are lurking: Perl can process UTF8 text as if it were latin1, as long as you don't do anything that requires Perl to actually understand the /semantics/ of the text, such as finding the length of a string, or sorting, or capitalizing. These require Perl to treat double-byte characters as one character, not two, and will fail unless you've told Perl that your data is UTF8. So, if you are lucky, in my next post I'll give you some pointers on how to track the lifeline of text in your Perl program and make sure that it's always correctly encoded, and that Perl is always treating it as UTF8, and has not somehow "forgotten" and started treating it as latin1, which is quite common. But for now, some things to remember: - The byte representing � in latin1 is equal to the UTF8 escape byte for most accented latin characters. - Double-encoded UTF8 happens when UTF8 bytes are treated as latin1, and this garbled text is then encoded as UTF8 - Perl (from 5.6 on) represents strings internally either using latin1 or UTF8, and uses an internal flag to designate which one. There are various reasons that the UTF8 flag might not get set, or might get unset. - Encode::decode('utf8') will effectively turn on the UTF8 flag flag, by interpeting a sequence of raw UTF8 bytes (which by default are interpreted as latin1) as UTF8, and producing a new string with the UTF8 flag on. - Perl IO streams treat bytes as Latin-1 (not UTF8 and not ASCII), unless you change the mode using binmode. Monday January 2, 2006 - 10:00pm (PST) Next Post: Perl Unicode in Practice