<?xml version="1.0" encoding="windows-1252"?>
<node id="868516" title="Re^3: Chicanery Needed to Handle Unicode Text on Microsoft Windows" created="2010-10-30 18:29:36" updated="2010-10-30 18:29:36">
<type id="11">
note</type>
<author id="961">
Anonymous Monk</author>
<data>
<field name="doctext">
&lt;blockquote&gt;
&lt;i&gt;
...to trust that
&lt;c&gt;
:raw:perlio:encoding(UTF-16LE):crlf
[download]
&lt;/c&gt;
is the best, right way to handle Unicode text in Perl on Windows.
&lt;/i&gt;
&lt;/blockquote&gt;
For older versions of Perl (&lt;= 5.8.8), you'd need an additional &lt;c&gt;:utf8&lt;/c&gt; layer at the end, i.e.
&lt;p&gt;
&lt;c&gt;
    :raw:perlio:encoding(UTF-16LE):crlf:utf8
&lt;/c&gt;
(although this isn't needed with newer versions, it doesn't do any harm either)
&lt;p&gt;
Without it, the strings would end up &lt;i&gt;without the utf8 flag&lt;/i&gt; set
(upon reading), which means that Perl wouldn't treat them as
text/unicode strings in regex comparisons, etc., as it should.
Similarly for writing.
&lt;p&gt;
&lt;c&gt;
$ hd Input.txt 
00000000  ff fe e4 00 62 00 63 00  0d 00 0a 00    |..ä.b.c.....|
&lt;/c&gt;
&lt;c&gt;
#!/usr/bin/perl -w
use strict;
use Devel::Peek;

open my $input_fh,
    '&lt;:raw:perlio:encoding(UTF-16):crlf', 'Input.txt';

my $line = &lt;$input_fh&gt;;
chomp $line;
Dump $line;
&lt;/c&gt;
5.8.8 output (wrong):&lt;p&gt;
&lt;c&gt;
SV = PV(0x69ae70) at 0x605000
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  PV = 0x6778e0 "\303\244bc"\0
  CUR = 4
  LEN = 80
&lt;/c&gt;
Output with newer versions (correct): &lt;p&gt;
&lt;c&gt;
SV = PV(0x750cb8) at 0x777cc8
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x86a070 "\303\244bc"\0 [UTF8 "\x{e4}bc"]
  CUR = 4
  LEN = 80
&lt;/c&gt;
&lt;p&gt;
This seems to be the only thing that's been fixed in the meantime.
&lt;p&gt;
I think this only goes to prove your point that this is way too arcane
for mere mortals... And, even though there is a "solution" to the
issue, the current behavior of the &lt;c&gt;:crlf&lt;/c&gt; layer is definitely a
bug, IMHO. For one, it violates the principle of least surprise. Instead, the following straightforward approach (as anyone sane in his mind would glean from the existing documentation) &lt;i&gt;should&lt;/i&gt; work:
&lt;p&gt;
&lt;c&gt;
open my $fh, '&lt;:encoding(UTF-16LE)', ...
&lt;/c&gt;
</field>
<field name="root_node">
868428</field>
<field name="parent_node">
868498</field>
<field name="reputation">
5</field>
</data>
</node>
