Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Hi everyone,

I need to read/write UCS-2 unicode files on Windows. I thought that specifying the appropriate PerlIO layer with open or binmode should suffice. However, this naive approach doesn't seem to work.

For example, trying to write a unicode UCS-2LE file like this (which is supposed to create two lines, each containing the unicode character codepoint U+8765)

my $filename = "test.ucs2le"; open my $fh, ">:encoding(ucs-2le)", $filename or die "Cannot open $filename for writing: $!"; print $fh "\x{feff}\x{8765}\n\x{8765}\n"; close $fh;

produces an incorrectly encoded file on Windows (works fine on Unix). The output file displays as garbage in unicode capable editors like notepad, and produces "UCS-2LE:Partial character ..." warnings, when you try to read the file back in from Perl.

Inspecting the hex dump of the file (e.g. with "od -tx1 -An test.ucs2le" on unix/cygwin),

ff fe 65 87 0d 0a 00 65 87 0d 0a 00

shows that the newline characters \n (or 0a in hex) have been replaced by \r\n (0d 0a in hex). Kind of like expected, except that with the 2-byte wide UCS-2 encoding, the 000a should've been turned into 000d 000a. IOW, the proper UCS-2LE encoding would have been:

ff fe 65 87 0d 00 0a 00 65 87 0d 00 0a 00

Looking at the PerlIO layer stack which is in effect when specifying :encoding(ucs-2le), reveals that the crlf layer (windows-specific default) is being applied after the UCS-2LE layer has turned characters into 2-byte values:

my $filename = "test.ucs2le"; open my $fh, ">:encoding(ucs-2le)", $filename or die; my @layers = PerlIO::get_layers($fh); print "@layers\n";


unix crlf encoding(UCS-2LE) utf8

(Note that, when writing, layers are being applied from right-to-left, while when reading, they're being applied from left-to-right. IOW, the left hand side of the layer stack as shown corresponds to the external side (file), and the right hand side is the Perl-internal data representation.)

Trying to find a workaround, I've been fiddling with this for quite a while. Finally, I came up with the following layer stack, which seems to do the trick:

my $filename = "test.ucs2le"; open my $fh, ">:raw:encoding(ucs-2le):crlf:utf8", $filename or die "Cannot open $filename for writing: $!"; print $fh "\x{feff}\x{8765}\n\x{8765}\n"; close $fh;

The :raw:encoding(ucs-2le):crlf:utf8 results in the following layers:

unix encoding(UCS-2LE) utf8 crlf utf8

:raw removes the initial default crlf layer,  :encoding(ucs-2le) adds the desired UCS-2 layer plus an automatically appended utf8,  :crlf puts the crlf layer in its proper position (such that it is being applied before conversion to 2-byte values happens), and the final :utf8 adds another utf8 layer. The latter is required because the crlf layer apparently is removing the UTF8-ness, without which unicode data would not be handled properly.

Although the duplicated utf8 layer doesn't seem to cause any problems, I'm not entirely sure if it'd always be completely free of side effects. (I haven't found a way to get rid of the first utf8 ... trying :pop to remove it is futile, as this also pops encoding(ucs-2le))

The same layers are needed for reading UCS-2 data, of course. In this case, the crlf conversion (i.e. \r\n --> \n) has to work on single-byte values, i.e. after the data has passed the UCS-2 filter. Otherwise, the filter would not detect the \r\n sequences, and we'd be left with an extraneous \r char at the end of every line (in which case chomp, with its default $/="\n", would no longer work as intended; and all kinds of other potential problems...).

OK, so far so good.  OTOH, is it only me thinking this is somewhat too involved for the average programmer looking for an easy, straightforward way to handle UCS-2 data? Is there a less cumbersome way to achieve the same effect? Or is this even a bug - or a known but yet unresolved issue?

This isn't UCS-2 specific, BTW. Any encoding with a minimal character size of more than one byte (like UTF-16, UTF-32) should pose similar problems...


In reply to PerlIO: crlf layer on Windows interfering with UCS-2 unicode by almut

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    [Eily]: well I had to google translate that
    [Eily]: let me inform you that the Dutch reek translate into the Czech páchnout
    [Eily]: very useful default translation :P
    [Eily]: BTW LanX, you should try typing a few random chars at the beginning of each message. This will prevent expansion :P
    [LanX]: qwiud you sthink so?
    [LanX]: zxwqbd good idea! :)
    LanX embraces his new habit spqopiwjdnq
    [ambrus]: qQUkZTmHTuKxStGT- BzTIK9gdudif7TkTLI t3mnF144UaAZjkknXY 8nN-QM19wHBsTrp5vB lEYU_Kksa7X1RIBB4x EWLD5X7SW3jGX5ryfN OMn_yL5FTdQxzjhtyX mKN9sjUCzBNHK5Rrp0 S2WMUvIb1i9aZFgjtq VR0GH1bjPMvm1G16iz hBqc1U6toPd4FbJOFj VsOeT745AN1_pO88rD SRAYKtBZwCZedESZmN mvutrOTHiSNwflB- pRfn_k
    [Eily]: so far it seems to work
    Your Mother reminds the monks they should be grateful not to share an office, lest they be subjugated to constant inanities like, "Czech please!"

    How do I use this? | Other CB clients
    Other Users?
    Others studying the Monastery: (12)
    As of 2017-03-27 16:48 GMT
    Find Nodes?
      Voting Booth?
      Should Pluto Get Its Planethood Back?

      Results (320 votes). Check out past polls.