|We don't bite newbies here... much|
I need to read/write UCS-2 unicode files on Windows. I thought that specifying the appropriate PerlIO layer with open or binmode should suffice. However, this naive approach doesn't seem to work.
For example, trying to write a unicode UCS-2LE file like this (which is supposed to create two lines, each containing the unicode character codepoint U+8765)
produces an incorrectly encoded file on Windows (works fine on Unix). The output file displays as garbage in unicode capable editors like notepad, and produces "UCS-2LE:Partial character ..." warnings, when you try to read the file back in from Perl.
Inspecting the hex dump of the file (e.g. with "od -tx1 -An test.ucs2le" on unix/cygwin),
shows that the newline characters \n (or 0a in hex) have been replaced by \r\n (0d 0a in hex). Kind of like expected, except that with the 2-byte wide UCS-2 encoding, the 000a should've been turned into 000d 000a. IOW, the proper UCS-2LE encoding would have been:
Looking at the PerlIO layer stack which is in effect when specifying :encoding(ucs-2le), reveals that the crlf layer (windows-specific default) is being applied after the UCS-2LE layer has turned characters into 2-byte values:
(Note that, when writing, layers are being applied from right-to-left, while when reading, they're being applied from left-to-right. IOW, the left hand side of the layer stack as shown corresponds to the external side (file), and the right hand side is the Perl-internal data representation.)
Trying to find a workaround, I've been fiddling with this for quite a while. Finally, I came up with the following layer stack, which seems to do the trick:
The :raw:encoding(ucs-2le):crlf:utf8 results in the following layers:
:raw removes the initial default crlf layer, :encoding(ucs-2le) adds the desired UCS-2 layer plus an automatically appended utf8, :crlf puts the crlf layer in its proper position (such that it is being applied before conversion to 2-byte values happens), and the final :utf8 adds another utf8 layer. The latter is required because the crlf layer apparently is removing the UTF8-ness, without which unicode data would not be handled properly.
Although the duplicated utf8 layer doesn't seem to cause any problems, I'm not entirely sure if it'd always be completely free of side effects. (I haven't found a way to get rid of the first utf8 ... trying :pop to remove it is futile, as this also pops encoding(ucs-2le))
The same layers are needed for reading UCS-2 data, of course. In this case, the crlf conversion (i.e. \r\n --> \n) has to work on single-byte values, i.e. after the data has passed the UCS-2 filter. Otherwise, the filter would not detect the \r\n sequences, and we'd be left with an extraneous \r char at the end of every line (in which case chomp, with its default $/="\n", would no longer work as intended; and all kinds of other potential problems...).
OK, so far so good. OTOH, is it only me thinking this is somewhat too involved for the average programmer looking for an easy, straightforward way to handle UCS-2 data? Is there a less cumbersome way to achieve the same effect? Or is this even a bug - or a known but yet unresolved issue?
This isn't UCS-2 specific, BTW. Any encoding with a minimal character size of more than one byte (like UTF-16, UTF-32) should pose similar problems...