in reply to Re^2: Chicanery Needed to Handle Unicode Text on Microsoft Windows
in thread Chicanery Needed to Handle Unicode Text on Microsoft Windows
...to trust thatFor older versions of Perl (<= 5.8.8), you'd need an additional :utf8 layer at the end, i.e.is the best, right way to handle Unicode text in Perl on Windows.:raw:perlio:encoding(UTF-16LE):crlf [download]
(although this isn't needed with newer versions, it doesn't do any harm either):raw:perlio:encoding(UTF-16LE):crlf:utf8
Without it, the strings would end up without the utf8 flag set (upon reading), which means that Perl wouldn't treat them as text/unicode strings in regex comparisons, etc., as it should. Similarly for writing.
$ hd Input.txt 00000000 ff fe e4 00 62 00 63 00 0d 00 0a 00 |..ä.b.c.....|
5.8.8 output (wrong):#!/usr/bin/perl -w use strict; use Devel::Peek; open my $input_fh, '<:raw:perlio:encoding(UTF-16):crlf', 'Input.txt'; my $line = <$input_fh>; chomp $line; Dump $line;
Output with newer versions (correct):SV = PV(0x69ae70) at 0x605000 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x6778e0 "\303\244bc"\0 CUR = 4 LEN = 80
SV = PV(0x750cb8) at 0x777cc8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x86a070 "\303\244bc"\0 [UTF8 "\x{e4}bc"] CUR = 4 LEN = 80
This seems to be the only thing that's been fixed in the meantime.
I think this only goes to prove your point that this is way too arcane for mere mortals... And, even though there is a "solution" to the issue, the current behavior of the :crlf layer is definitely a bug, IMHO. For one, it violates the principle of least surprise. Instead, the following straightforward approach (as anyone sane in his mind would glean from the existing documentation) should work:
open my $fh, '<:encoding(UTF-16LE)', ...
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^4: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by Jim (Curate) on Oct 31, 2010 at 18:10 UTC | |
by Anonymous Monk on Oct 31, 2010 at 18:12 UTC |