Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
Perl-Sensitive Sunglasses
 
PerlMonks  

Chicanery Needed to Handle Unicode Text on Microsoft Windows

by Jim (Curate)
on Oct 30, 2010 at 06:18 UTC ( #868428=perlquestion: print w/ replies, xml ) Need Help??
Jim has asked for the wisdom of the Perl Monks concerning the following question:

It likely takes a new Perl programmer a long time to learn how to read and write a Unicode UTF-16 file. The following script would not be obvious to someone just beginning to learn Perl.

#!perl use strict; use warnings; open my $input_fh, '<:encoding(UTF-16LE)', 'Input.txt'; open my $output_fh, '>:encoding(UTF-16LE)', 'Output.txt'; while (my $line = <$input_fh>) { chomp $line; print $output_fh "$line\n"; }

Imagine how glorious the neophyte would feel once he had finally figured out the right way to handle Unicode text in Perl after having slogged through the nearly impenetrable Perl Unicode documentation for hours. Then imagine how frustrating it would be for him to run the script and realize it doesn't work. It creates a badly broken and unusable text file (on Microsoft Windows, at least).

But our neophyte is patient and persistent. He Googles for help, and after several more hours of painstaking research and experimentation, he determines the following script works.

#!perl use strict; use warnings; open my $input_fh, '<:raw:perlio:encoding(UTF-16LE):crlf', 'Input.txt'; open my $output_fh, '>:raw:perlio:encoding(UTF-16LE):crlf', 'Output.txt'; while (my $line = <$input_fh>) { chomp $line; print $output_fh "$line\n"; }

The chicanery needed just to read and write a Unicode file on Windows using Perl is absurd. It's much too arcane.

Can someone explain how this sequence of PerlIO layers works? Why must so many layers be used? Can these layers be specified using the open pragma? If so, how? If not, why not? And why has this ancient Perl bug still not been fixed in version 5.12.2?

In fact, the second, more elaborate version of the script is still wrong. The file named Input.txt has a byte order mark in it, so its encoding is actually UTF-16, not UTF-16LE. It seems there's no way to generate a UTF-16 file in little-endian byte order directly. To generate such a file, you have to specify the UTF-16LE CES (which is wrong) and add the byte order mark explictly to make it UTF-16 instead of UTF-16LE.

#!perl use strict; use warnings; use charnames qw( :full ); open my $input_fh, '<:raw:perlio:encoding(UTF-16):crlf', 'Input.txt'; open my $output_fh, '>:raw:perlio:encoding(UTF-16LE):crlf', 'Output.txt'; print $output_fh "\N{BYTE ORDER MARK}"; while (my $line = <$input_fh>) { chomp $line; print $output_fh "$line\n"; }

Comment on Chicanery Needed to Handle Unicode Text on Microsoft Windows
Select or Download Code
Re: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by Anonymous Monk on Oct 30, 2010 at 07:51 UTC
    He Googles for help

    utf16le site:perlmonks.org
    UTF-16 on WinXP written by Perl shows whitespaces.
    crlf mess in unicode utf-16le

    Can someone explain how this sequence of PerlIO layers works?

    See PerlIO

    Why must so many layers be used?

    Because of the defaults, see PerlIO

    Can these layers be specified using the open pragma? If so, how? If not, why not?

    This should work

    use open qw' IO :raw:perlio:encoding(UTF-16LE):crlf ';
    but apparently open pragma is broken and doesn't accept the same things as binmode/open

    And why has this ancient Perl bug still not been fixed in 5.12.2?

    I'm not a perl5-porter so I'm not sure, but it doesn't look like a bug exactly, and nobodys come up with a better way, or reported a bug (that I could find).

    It seems there's no way to generate a UTF-16 file in little-endian byte order directly. To generate such a file, you have to specify the UTF-16LE CES (which is wrong) and add the byte order mark explictly to make it UTF-16 instead of UTF-16LE.

    maybe :encoding(UTF-16LE):via(File::BOM)

      This thread is refreshing to read!!! As a Windows user that is somewhat new to Perl, I spent the past few hours trying to figure out why one of my supplied 193 xml files would keep outputting as a bunch of Chinese (?) characters. Jim described exactly what I kept trying.

      I finished my script. Everything else works - it does all my replaces beautifully. I have maybe spent 8 hours total on my script and it will save me about 3 days of work.

      But, for now, I have to go to that specific XML file, open it in Notepad, and save it as 'ANSI' instead of 'Unicode' before my script will work right.

      I have tried adding the use ' $string' supplied in this thread, but I get this error:

      Unknown PerlIO layer 'raw:perlio:encoding(UTF-16LE):crlf:utf8'

      I really would like to create re-usable code out of my script, but I have yet to find the answer.

        I have tried adding the use ' $string' supplied in this thread, but I get this error:

        Which perl version do you have?

        open it in Notepad, and save it as 'ANSI' instead of 'Unicode' before my script will work right.

        You probably shouldn't do that :) save as UTF-8 instead

        iconv -f UTF-16 -t UTF-8 < in > out

        piconv -f UTF-16LE -t UTF-8 < in > out

Re: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by kcott (Abbot) on Oct 30, 2010 at 09:15 UTC

    As far as I can see, the only reason you need :crlf is because you've specifically added the UNIX line ending (\n) to your output. It would be better to use the platform-independent $/. The :raw layer should preserve the line endings. So that reduces the chicanery somewhat.

    Except for ASCII files, binmode($file_handle) was required on MSWin32 systems. :raw performs the same function so, while perhaps appearing to add to the chicanery, it certainly reduces the amount of code.

    I don't have sufficient knowledge of UTF-16 to address that aspect of you post. What I would suggest is that, after removing :crlf and changing \n to $/, you try your test code without :perlio. You may still need it but it wouldn't hurt to check.

    I agree there's a lot of Unicode-related documentation; however, everything I've made reference to is available here: PerlIO.

    I ran a series of tests, click on Read more... to view.

    -- Ken

      Um, the purpose of crlf is so that you can use \n and it will the appropriate thing for your platform -- \n is portable.

        See response to Jim (below).

        -- Ken

      As far as I can see, the only reason you need :crlf is because you've specifically added the UNIX line ending (\n) to your output.

      :crlf is needed here to get the same platform-independent line-ending handling of plain text files Perl has always supported. Without it, the line-ending handling is badly broken. Half of the line-ending character pair CRLF is missed.

      D:\>cat Demo.pl #!perl use strict; use warnings; open my $input_fh, '<:raw:perlio:encoding(UTF-16LE)', 'Input.txt'; while (my $line = <$input_fh>) { chomp $line; print "There's an unexpected/unwanted CR at the end of the line\n" if $line =~ m/\r$/; } D:\>file Input.txt Input.txt: Text file, Unicode little endian format D:\>cat Input.txt We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America. D:\>perl Demo.pl Input.txt There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line There's an unexpected/unwanted CR at the end of the line D:\>

      And as Anonymous Monk has already pointed out, \n is the express mechanism in Perl intended to make line-ending handling platform-independent. It is defined not to mean the LF-only Unix line-ending, but rather to mean whatever the line-ending character or character combination terminates lines of plain text files on the platform in use.

      It would be better to use the platform-independent $/.

      No it wouldn't. And even if it were better, how would someone new to Perl ever figure that out. I've been programming Perl for years and I've never once seen $/ used in place of the usual and ordinary \n. chomp()-ing and "...\n"-ing are the long-lived and ubiquitous standard idioms.

      #!perl print "Hello, world\n";
      Except for ASCII files, binmode($file_handle) was required on MSWin32 systems. :raw performs the same function so, while perhaps appearing to add to the chicanery, it certainly reduces the amount of code.

      But this is the whole point. The file named Input.txt is not a binary file; it's a plain text file. All the Unicode files I want to manipulate on Microsoft Windows using Perl, the text-processing scripting language, are plain text files. binmode() and :raw are lies. Chicanery.

      In my humble opinion, this should work on a Unicode UTF-16 file with a byte order mark.

      #!perl use strict; use warnings; open my $input_fh, '<', 'Input.txt'; open my $output_fh, '>', 'Output.txt'; while (my $line = <$input_fh>) { chomp $line; print $output_fh "$line\n"; }

      It seems perfectly reasonable to me to expect the scripting language to determine the character encoding of the file all by its little lonesome it only has to read the first two bytes of the file and just to do the right thing.

        The documentation on :raw says that CRLF conversion is turned off. It appears that \n in the print statement is represented as CRLF before the arguments to print enter the output stream so \n can be used as normal.

        Changing $/ to \n in my tests (not surprisingly) produces the same results.

        -- Ken

Re: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by Anonymous Monk on Oct 30, 2010 at 10:55 UTC
Re: Chicanery Needed to Handle Unicode Text on Microsoft Windows
by ikegami (Pope) on Oct 30, 2010 at 17:58 UTC

    Can someone explain how this sequence of PerlIO layers works?

    The default is

    :perlio:crlf

    If you add encoding layer it becomes

    :perlio:crlf:encoding(UTF-16le)

    The order is backwards. CRLF processing is done before decoding on read and after encoding on write. Buggy! the following is desired:

    :perlio:encoding(UTF-16le):crlf

    :raw cleans the slate, allowing you to get the desired order.

      Thank you, ++ikegami. I understand your explanation just enough to trust that

      :raw:perlio:encoding(UTF-16LE):crlf

      is the best, right way to handle Unicode text in Perl on Windows.

      Should one use the same layers in the same order for both input and output? Also, do you know why it doesn't work with the open pragma?

      I think you and others understand the point I'm making. If your text file is 40 years old and not EBCDIC, then it's ASCII, and writing a Perl script to handle it is easy. You're not forced to think about the character encoding of the text at all. But if you created the text file just now using Microsoft Notepad, writing a Perl script to do anything useful with the text in the file is beyond the capabilities of a neophyte Perl programmer. No one new to the language could arrive at this exceedingly arcane solution to the problem of handling a simple Unicode text file by reading any of the Perl documentation, especially PerlIO, or any books about the language. (PerlIO is incomprehensible to anyone who doesn't already know everything it documents.)

      UPDATE: The expert Perl programmers addressing this same problem at Stack Overflow never arrived at the correct solution proffered here.

        ...to trust that
        :raw:perlio:encoding(UTF-16LE):crlf [download]
        is the best, right way to handle Unicode text in Perl on Windows.
        For older versions of Perl (<= 5.8.8), you'd need an additional :utf8 layer at the end, i.e.

        :raw:perlio:encoding(UTF-16LE):crlf:utf8
        (although this isn't needed with newer versions, it doesn't do any harm either)

        Without it, the strings would end up without the utf8 flag set (upon reading), which means that Perl wouldn't treat them as text/unicode strings in regex comparisons, etc., as it should. Similarly for writing.

        $ hd Input.txt 00000000 ff fe e4 00 62 00 63 00 0d 00 0a 00 |...b.c.....|
        #!/usr/bin/perl -w use strict; use Devel::Peek; open my $input_fh, '<:raw:perlio:encoding(UTF-16):crlf', 'Input.txt'; my $line = <$input_fh>; chomp $line; Dump $line;
        5.8.8 output (wrong):

        SV = PV(0x69ae70) at 0x605000 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x6778e0 "\303\244bc"\0 CUR = 4 LEN = 80
        Output with newer versions (correct):

        SV = PV(0x750cb8) at 0x777cc8 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x86a070 "\303\244bc"\0 [UTF8 "\x{e4}bc"] CUR = 4 LEN = 80

        This seems to be the only thing that's been fixed in the meantime.

        I think this only goes to prove your point that this is way too arcane for mere mortals... And, even though there is a "solution" to the issue, the current behavior of the :crlf layer is definitely a bug, IMHO. For one, it violates the principle of least surprise. Instead, the following straightforward approach (as anyone sane in his mind would glean from the existing documentation) should work:

        open my $fh, '<:encoding(UTF-16LE)', ...

        Should one use the same layers in the same order for both input and output?

        They are processed from the file handle out when reading, and in the opposite direction when writing.

        Also, do you know why it doesn't work with the open pragma?

        Maybe it does the equivalent of binmode, and binmode doesn't remove the existing layers. (:raw simply ends up disabling the crlf layer, then :crlf reenables the existing layer rather than adding a new layer.)

      I see a problem with testing
      $ perl -le " binmode STDERR, q!:encoding(UTF-16le)!; print join q! !, +PerlIO::get_layers( STDERR , details => 1) unix 18895360 crlf 13193728 encoding UTF-16LE 13144576 $ perl -le " print join q! !, PerlIO::get_layers( STDERR , details => + 1) unix 18895360 crlf 13193728 $ perl -le " binmode STDERR; print join q! !, PerlIO::get_layers( ST +DERR , details => 1) unix 18895360 crlf 13177344 $ perl -le " binmode STDERR, q!:encoding(UTF-16le)!; print join q! ! +, PerlIO::get_layers( STDERR , details => 1) unix 18895360 crlf 13193728 encoding UTF-16LE 13144576 $ perl -le " binmode STDERR, q!:raw:perlio:encoding(UTF-16le):crlf!; + print join q! !, PerlIO::get_layers( STDERR , details => 1) unix 18895360 crlf 13193728 perlio 13111808 encoding UTF-16LE 13144 +576 $
      So there is a bug somewhere

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://868428]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2014-04-20 16:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (485 votes), past polls