http://www.perlmonks.org?node_id=889260


in reply to utf8 writing

Your problem most likely is that Perl doesn't know that $final_content is encoded in UTF-8, because the content has never been decoded.

Compare the following 4 cases.  Let's say you have the character Ω (Omega), Unicode number U+03A9. The UTF-8 encoding of this character is the two bytes CE A9.  Let's also assume you have a terminal that expects characters to be encoded in UTF-8.

Case 1:

my $text = "\xCE\xA9"; # Omega, UTF-8 encoded print $text; # prints OK open FH, ">:utf8", "myfile" or die $!; print FH $text; # wrong: C3 8E C2 A9

This is what you have (presumably).  Perl doesn't know it is (or should be) handling an Omega, because it's never been told the two bytes CE A9 are supposed to represent an Omega.

Thus, it treats it as two separate bytes when printing them to the terminal. The terminal sees CE A9, and, as it expects text to be encoded in UTF-8, renders them correctly.

Not so, however, when you print to the file handle, which you've declared to be ":utf8". Here, Perl assumes the two bytes are two characters encoded in Latin-1 (the default assumption), and encodes them into UTF-8, producing the junk C3 8E (= 'Î'), and C2 A9 (= '©'), instead of the correct UTF-8 encoding for Omega, which would be CE A9.

Case 2:

use Encode; my $text = decode("UTF-8", "\xCE\xA9"); print $text; # wrong: "Wide character in print at..." open FH, ">:utf8", "myfile" or die $!; print FH $text; # OK

Here, we're telling Perl the input is UTF-8 encoded, by decoding it. So, Perl treats it as one character (Omega), and prints it correctly to the file. However, we've forgotten to tell Perl that the terminal expects UTF-8, so it warns "Wide character in print".  With that fixed, we get

Case 3:

use Encode; my $text = decode("UTF-8", "\xCE\xA9"); binmode STDOUT, ":utf8"; print $text; # OK open FH, ">:utf8", "myfile" or die $!; print FH $text; # OK

That's how everything is supposed to be — no errors or warnings.

But there's another one:

Case 4:

my $text = "\xCE\xA9"; print $text; # OK open FH, ">", "myfile" or die $!; # no PerlIO encoding layer print FH $text; # OK

This also renders correctly in the terminal, and produces the right content in the file.  However, although this appears to be correct, it isn't, at least not if you want to treat the content as text. For example, if you wanted to match against Omega (i.e. \x{03A9})

print "is Omega" if $text =~ /\x{03A9}/; # doesn't match!

it wouldn't work, because Perl here (in case 4) internally handles two separate bytes, instead of one character.  The same line of code would work fine in case 3.

Note that although I'm using Encode's decode() routine in the examples, there are several other ways to decode data.  E.g., when reading from a file, you'd normally use a PerlIO layer with open, such as "<:encoding(UTF-8)".

(See UTF8 related proof of concept exploit released at T-DOSE for why "<:encoding(UTF-8)", and not "<:utf8", when used as input layer.)