Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Your problem most likely is that Perl doesn't know that $final_content is encoded in UTF-8, because the content has never been decoded.

Compare the following 4 cases.  Let's say you have the character Ω (Omega), Unicode number U+03A9. The UTF-8 encoding of this character is the two bytes CE A9.  Let's also assume you have a terminal that expects characters to be encoded in UTF-8.

Case 1:

my $text = "\xCE\xA9"; # Omega, UTF-8 encoded print $text; # prints OK open FH, ">:utf8", "myfile" or die $!; print FH $text; # wrong: C3 8E C2 A9

This is what you have (presumably).  Perl doesn't know it is (or should be) handling an Omega, because it's never been told the two bytes CE A9 are supposed to represent an Omega.

Thus, it treats it as two separate bytes when printing them to the terminal. The terminal sees CE A9, and, as it expects text to be encoded in UTF-8, renders them correctly.

Not so, however, when you print to the file handle, which you've declared to be ":utf8". Here, Perl assumes the two bytes are two characters encoded in Latin-1 (the default assumption), and encodes them into UTF-8, producing the junk C3 8E (= 'Î'), and C2 A9 (= '©'), instead of the correct UTF-8 encoding for Omega, which would be CE A9.

Case 2:

use Encode; my $text = decode("UTF-8", "\xCE\xA9"); print $text; # wrong: "Wide character in print at..." open FH, ">:utf8", "myfile" or die $!; print FH $text; # OK

Here, we're telling Perl the input is UTF-8 encoded, by decoding it. So, Perl treats it as one character (Omega), and prints it correctly to the file. However, we've forgotten to tell Perl that the terminal expects UTF-8, so it warns "Wide character in print".  With that fixed, we get

Case 3:

use Encode; my $text = decode("UTF-8", "\xCE\xA9"); binmode STDOUT, ":utf8"; print $text; # OK open FH, ">:utf8", "myfile" or die $!; print FH $text; # OK

That's how everything is supposed to be — no errors or warnings.

But there's another one:

Case 4:

my $text = "\xCE\xA9"; print $text; # OK open FH, ">", "myfile" or die $!; # no PerlIO encoding layer print FH $text; # OK

This also renders correctly in the terminal, and produces the right content in the file.  However, although this appears to be correct, it isn't, at least not if you want to treat the content as text. For example, if you wanted to match against Omega (i.e. \x{03A9})

print "is Omega" if $text =~ /\x{03A9}/; # doesn't match!

it wouldn't work, because Perl here (in case 4) internally handles two separate bytes, instead of one character.  The same line of code would work fine in case 3.

Note that although I'm using Encode's decode() routine in the examples, there are several other ways to decode data.  E.g., when reading from a file, you'd normally use a PerlIO layer with open, such as "<:encoding(UTF-8)".

(See UTF8 related proof of concept exploit released at T-DOSE for why "<:encoding(UTF-8)", and not "<:utf8", when used as input layer.)

In reply to Re: utf8 writing by Eliya
in thread utf8 writing by lingaraj

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    shmem flollops

    How do I use this? | Other CB clients
    Other Users?
    Others making s'mores by the fire in the courtyard of the Monastery: (9)
    As of 2018-03-23 11:34 GMT
    Find Nodes?
      Voting Booth?
      When I think of a mole I think of:

      Results (290 votes). Check out past polls.