http://www.perlmonks.org?node_id=868489


in reply to Chicanery Needed to Handle Unicode Text on Microsoft Windows

Can someone explain how this sequence of PerlIO layers works?

The default is

:perlio:crlf

If you add encoding layer it becomes

:perlio:crlf:encoding(UTF-16le)

The order is backwards. CRLF processing is done before decoding on read and after encoding on write. Buggy! the following is desired:

:perlio:encoding(UTF-16le):crlf

:raw cleans the slate, allowing you to get the desired order.

Replies are listed 'Best First'.
Re^2: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by Jim (Curate) on Oct 30, 2010 at 19:07 UTC

    Thank you, ++ikegami. I understand your explanation just enough to trust that

    :raw:perlio:encoding(UTF-16LE):crlf

    is the best, right way to handle Unicode text in Perl on Windows.

    Should one use the same layers in the same order for both input and output? Also, do you know why it doesn't work with the open pragma?

    I think you and others understand the point I'm making. If your text file is 40 years old and not EBCDIC, then it's ASCII, and writing a Perl script to handle it is easy. You're not forced to think about the character encoding of the text at all. But if you created the text file just now using Microsoft Notepad, writing a Perl script to do anything useful with the text in the file is beyond the capabilities of a neophyte Perl programmer. No one new to the language could arrive at this exceedingly arcane solution to the problem of handling a simple Unicode text file by reading any of the Perl documentation, especially PerlIO, or any books about the language. (PerlIO is incomprehensible to anyone who doesn't already know everything it documents.)

    UPDATE: The expert Perl programmers addressing this same problem at Stack Overflow never arrived at the correct solution proffered here.

      ...to trust that
      :raw:perlio:encoding(UTF-16LE):crlf [download]
      is the best, right way to handle Unicode text in Perl on Windows.
      For older versions of Perl (<= 5.8.8), you'd need an additional :utf8 layer at the end, i.e.

      :raw:perlio:encoding(UTF-16LE):crlf:utf8
      (although this isn't needed with newer versions, it doesn't do any harm either)

      Without it, the strings would end up without the utf8 flag set (upon reading), which means that Perl wouldn't treat them as text/unicode strings in regex comparisons, etc., as it should. Similarly for writing.

      $ hd Input.txt 00000000 ff fe e4 00 62 00 63 00 0d 00 0a 00 |..ä.b.c.....|
      #!/usr/bin/perl -w use strict; use Devel::Peek; open my $input_fh, '<:raw:perlio:encoding(UTF-16):crlf', 'Input.txt'; my $line = <$input_fh>; chomp $line; Dump $line;
      5.8.8 output (wrong):

      SV = PV(0x69ae70) at 0x605000 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x6778e0 "\303\244bc"\0 CUR = 4 LEN = 80
      Output with newer versions (correct):

      SV = PV(0x750cb8) at 0x777cc8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x86a070 "\303\244bc"\0 [UTF8 "\x{e4}bc"] CUR = 4 LEN = 80

      This seems to be the only thing that's been fixed in the meantime.

      I think this only goes to prove your point that this is way too arcane for mere mortals... And, even though there is a "solution" to the issue, the current behavior of the :crlf layer is definitely a bug, IMHO. For one, it violates the principle of least surprise. Instead, the following straightforward approach (as anyone sane in his mind would glean from the existing documentation) should work:

      open my $fh, '<:encoding(UTF-16LE)', ...
        For older versions of Perl (<= 5.8.8), you'd need an additional :utf8 layer at the end, i.e. :raw:perlio:encoding(UTF-16LE):crlf:utf8 (although this isn't needed with newer versions, it doesn't do any harm either)

        So do the cognoscenti of the Perl community agree then? The canonical workaround to the Perl UTF-16-on-Windows defect is to use the following sequence of layers in the three-argument form of open for both input (<) and output (>).

        :raw:perlio:encoding(UTF-16LE):crlf:utf8

        Thus.

        open my $input_fh, '<:raw:perlio:encoding(UTF-16LE):crlf:utf8', $input_file or die "Can't open input file $input_file: $OS_ERROR\n"; open my $output_fh, '>:raw:perlio:encoding(UTF-16LE):crlf:utf8', $output_file or die "Can't open output file $output_file: $OS_ERROR\n";
        I think this only goes to prove your point that this is way too arcane for mere mortals... And, even though there is a "solution" to the issue, the current behavior of the :crlf layer is definitely a bug, IMHO. For one, it violates the principle of least surprise. Instead, the following straightforward approach (as anyone sane in his mind would glean from the existing documentation) should work: open my $fh, '<:encoding(UTF-16LE)', ...

        Thank you! That's all I'm saying.

      Should one use the same layers in the same order for both input and output?

      They are processed from the file handle out when reading, and in the opposite direction when writing.

      Also, do you know why it doesn't work with the open pragma?

      Maybe it does the equivalent of binmode, and binmode doesn't remove the existing layers. (:raw simply ends up disabling the crlf layer, then :crlf reenables the existing layer rather than adding a new layer.)

Re^2: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by Anonymous Monk on Oct 31, 2010 at 17:16 UTC
    I see a problem with testing
    $ perl -le " binmode STDERR, q!:encoding(UTF-16le)!; print join q! !, +PerlIO::get_layers( STDERR , details => 1) unix 18895360 crlf 13193728 encoding UTF-16LE 13144576 $ perl -le " print join q! !, PerlIO::get_layers( STDERR , details => + 1) unix 18895360 crlf 13193728 $ perl -le " binmode STDERR; print join q! !, PerlIO::get_layers( ST +DERR , details => 1) unix 18895360 crlf 13177344 $ perl -le " binmode STDERR, q!:encoding(UTF-16le)!; print join q! ! +, PerlIO::get_layers( STDERR , details => 1) unix 18895360 crlf 13193728 encoding UTF-16LE 13144576 $ perl -le " binmode STDERR, q!:raw:perlio:encoding(UTF-16le):crlf!; + print join q! !, PerlIO::get_layers( STDERR , details => 1) unix 18895360 crlf 13193728 perlio 13111808 encoding UTF-16LE 13144 +576 $
    So there is a bug somewhere
        Gah! Jim is wrong, thats not Chicanery, thats brainfuck squared