Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^3: Chicanery Needed to Handle Unicode Text on Microsoft Windows

by Anonymous Monk
on Oct 30, 2010 at 22:29 UTC ( #868516=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Chicanery Needed to Handle Unicode Text on Microsoft Windows
in thread Chicanery Needed to Handle Unicode Text on Microsoft Windows

...to trust that
:raw:perlio:encoding(UTF-16LE):crlf [download]
is the best, right way to handle Unicode text in Perl on Windows.
For older versions of Perl (<= 5.8.8), you'd need an additional :utf8 layer at the end, i.e.

:raw:perlio:encoding(UTF-16LE):crlf:utf8
(although this isn't needed with newer versions, it doesn't do any harm either)

Without it, the strings would end up without the utf8 flag set (upon reading), which means that Perl wouldn't treat them as text/unicode strings in regex comparisons, etc., as it should. Similarly for writing.

$ hd Input.txt 00000000 ff fe e4 00 62 00 63 00 0d 00 0a 00 |...b.c.....|
#!/usr/bin/perl -w use strict; use Devel::Peek; open my $input_fh, '<:raw:perlio:encoding(UTF-16):crlf', 'Input.txt'; my $line = <$input_fh>; chomp $line; Dump $line;
5.8.8 output (wrong):

SV = PV(0x69ae70) at 0x605000 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x6778e0 "\303\244bc"\0 CUR = 4 LEN = 80
Output with newer versions (correct):

SV = PV(0x750cb8) at 0x777cc8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x86a070 "\303\244bc"\0 [UTF8 "\x{e4}bc"] CUR = 4 LEN = 80

This seems to be the only thing that's been fixed in the meantime.

I think this only goes to prove your point that this is way too arcane for mere mortals... And, even though there is a "solution" to the issue, the current behavior of the :crlf layer is definitely a bug, IMHO. For one, it violates the principle of least surprise. Instead, the following straightforward approach (as anyone sane in his mind would glean from the existing documentation) should work:

open my $fh, '<:encoding(UTF-16LE)', ...


Comment on Re^3: Chicanery Needed to Handle Unicode Text on Microsoft Windows
Select or Download Code
Re^4: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by Jim (Curate) on Oct 31, 2010 at 18:10 UTC
    For older versions of Perl (<= 5.8.8), you'd need an additional :utf8 layer at the end, i.e. :raw:perlio:encoding(UTF-16LE):crlf:utf8 (although this isn't needed with newer versions, it doesn't do any harm either)

    So do the cognoscenti of the Perl community agree then? The canonical workaround to the Perl UTF-16-on-Windows defect is to use the following sequence of layers in the three-argument form of open for both input (<) and output (>).

    :raw:perlio:encoding(UTF-16LE):crlf:utf8

    Thus.

    open my $input_fh, '<:raw:perlio:encoding(UTF-16LE):crlf:utf8', $input_file or die "Can't open input file $input_file: $OS_ERROR\n"; open my $output_fh, '>:raw:perlio:encoding(UTF-16LE):crlf:utf8', $output_file or die "Can't open output file $output_file: $OS_ERROR\n";
    I think this only goes to prove your point that this is way too arcane for mere mortals... And, even though there is a "solution" to the issue, the current behavior of the :crlf layer is definitely a bug, IMHO. For one, it violates the principle of least surprise. Instead, the following straightforward approach (as anyone sane in his mind would glean from the existing documentation) should work: open my $fh, '<:encoding(UTF-16LE)', ...

    Thank you! That's all I'm saying.

      Nope, you're wrong Jim, it badly broken, to call it arcane is flattery

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://868516]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2014-10-23 04:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (124 votes), past polls