Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Your problem most likely is that Perl doesn't know that $final_content is encoded in UTF-8, because the content has never been decoded.

Compare the following 4 cases.  Let's say you have the character Ω (Omega), Unicode number U+03A9. The UTF-8 encoding of this character is the two bytes CE A9.  Let's also assume you have a terminal that expects characters to be encoded in UTF-8.

Case 1:

my $text = "\xCE\xA9"; # Omega, UTF-8 encoded print $text; # prints OK open FH, ">:utf8", "myfile" or die $!; print FH $text; # wrong: C3 8E C2 A9

This is what you have (presumably).  Perl doesn't know it is (or should be) handling an Omega, because it's never been told the two bytes CE A9 are supposed to represent an Omega.

Thus, it treats it as two separate bytes when printing them to the terminal. The terminal sees CE A9, and, as it expects text to be encoded in UTF-8, renders them correctly.

Not so, however, when you print to the file handle, which you've declared to be ":utf8". Here, Perl assumes the two bytes are two characters encoded in Latin-1 (the default assumption), and encodes them into UTF-8, producing the junk C3 8E (= 'Î'), and C2 A9 (= '©'), instead of the correct UTF-8 encoding for Omega, which would be CE A9.

Case 2:

use Encode; my $text = decode("UTF-8", "\xCE\xA9"); print $text; # wrong: "Wide character in print at..." open FH, ">:utf8", "myfile" or die $!; print FH $text; # OK

Here, we're telling Perl the input is UTF-8 encoded, by decoding it. So, Perl treats it as one character (Omega), and prints it correctly to the file. However, we've forgotten to tell Perl that the terminal expects UTF-8, so it warns "Wide character in print".  With that fixed, we get

Case 3:

use Encode; my $text = decode("UTF-8", "\xCE\xA9"); binmode STDOUT, ":utf8"; print $text; # OK open FH, ">:utf8", "myfile" or die $!; print FH $text; # OK

That's how everything is supposed to be — no errors or warnings.

But there's another one:

Case 4:

my $text = "\xCE\xA9"; print $text; # OK open FH, ">", "myfile" or die $!; # no PerlIO encoding layer print FH $text; # OK

This also renders correctly in the terminal, and produces the right content in the file.  However, although this appears to be correct, it isn't, at least not if you want to treat the content as text. For example, if you wanted to match against Omega (i.e. \x{03A9})

print "is Omega" if $text =~ /\x{03A9}/; # doesn't match!

it wouldn't work, because Perl here (in case 4) internally handles two separate bytes, instead of one character.  The same line of code would work fine in case 3.

Note that although I'm using Encode's decode() routine in the examples, there are several other ways to decode data.  E.g., when reading from a file, you'd normally use a PerlIO layer with open, such as "<:encoding(UTF-8)".

(See UTF8 related proof of concept exploit released at T-DOSE for why "<:encoding(UTF-8)", and not "<:utf8", when used as input layer.)


In reply to Re: utf8 writing by Eliya
in thread utf8 writing by lingaraj

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others contemplating the Monastery: (14)
    As of 2014-12-19 16:14 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (86 votes), past polls