almut has asked for the wisdom of the Perl Monks concerning the following question:
Hi all,
I would like to read UTF-16 files which occasionally contain
malformed data, in particular malformed surrogate pairs
(such as for example DBF4 (a "high surrogate") not followed by an expected
"low surrogate" in the range DC00 - DFFF).
The straightforward approach of opening the file with
open my $fh, "<:encoding(UTF-16)", "somefile.utf16le" or die "...: $!" +;
unfortunately croaks with the error
UTF-16:Malformed LO surrogate dbf4 at ...
As I gleaned from the PerlIO::encoding docs, one solution might be to set $PerlIO::encoding::fallback = Encode::FB_DEFAULT in order to make the PerlIO layer assume the default behavior of the routine Encode::decode(), which is to replace the malformed character with the (valid) code point U+FFFD. For example
use Encode; use PerlIO::encoding; $PerlIO::encoding::fallback = Encode::FB_DEFAULT; # BTW, why is FB_DEFAULT not the default +? my $malformed = "a\x00b\x00c\x00\xF4\xDBd\x00e\x00f\x00\n\x00" . "g\x00h\x00i\x00\n\x00"; # = "abc<some junk>def\nghi\n" in UTF-16LE open my $fh, "<:encoding(UTF-16LE)", \$malformed or die $!; while ( my $u = <$fh> ) { print $u; } close $fh;
Although this does work to some degree (i.e. it no longer croaks and does in fact translate malformed characters to FFFD), there are some irritating behaviors.
Most importantly, the above while loop becomes an endless loop, i.e. it starts reading from the beginning of the file after having reached its end. In other words, the output produced is
abc?ef ghi abc?ef ghi ... (repeated ad infinitum)
(This seems to be a bug, unless there's something specifically wrong with my perls — tried it with 5.10.0 and 5.8.8, both x86_64-linux.)
Some other secondary issues are:
-
The malformed character substitution swallows the subsequent character (the 'd' in the above example). In other words, I would like the first line to be abc?def ('?' being the replacement char). I figure this makes sense with UTF-16, as the 'd' is taken to be the low-surrogate of the invalid surrogate pair, with the whole pair being rendered into one replacement char. So I also tried UCS-2 in place of UTF-16, in the hope that it - being a fixed two-byte encoding - would not exhibit this behavior... However, there is no difference to UTF-16 in this regard.
-
If the last line doesn't end with a line feed, its content is somehow prepended to the previous line, producing the incorrect string "ghiabc?ef".
-
With Perl 5.8.8, setting $PerlIO::encoding::fallback as shown above does only work with UCS-2LE. With UTF-16LE and UTF-16 (correct BOM in place for the latter), this PerlIO setting doesn't seem to have any effect, i.e. it still complains about "malformed LO surrogate". With UCS-2, it doesn't produce any output at all, but just hangs with 100% CPU load, eating up more and more memory. With Perl 5.10.0, OTOH, setting $PerlIO::encoding::fallback does work with all four encodings. (Unfortunately, I would need to have it working with 5.8.8 and UTF-16... as well as preferably with a BOM, as the file in question is a Windows registry dump, which does have a BOM.)
BTW, I'm using the fancy open-stringref form just for purposes of easy demo. In real life I'm opening a regular file. Same behavior.
Any ideas how to get this working, preferably without resorting to reading the file in binary and doing the decoding myself (which I'd like to avoid for various reasons)? Bonus points, if the 'd' doesn't get swallowed... :) Thanks!
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Handling malformed UTF-16 data with PerlIO layer
by ikegami (Patriarch) on Oct 27, 2008 at 21:49 UTC | |
by almut (Canon) on Oct 27, 2008 at 22:38 UTC | |
by ikegami (Patriarch) on Oct 28, 2008 at 00:24 UTC | |
by almut (Canon) on Oct 28, 2008 at 01:02 UTC | |
by graff (Chancellor) on Oct 28, 2008 at 06:33 UTC | |
| |
by ikegami (Patriarch) on Oct 28, 2008 at 02:34 UTC | |
Re: Handling malformed UTF-16 data with PerlIO layer
by graff (Chancellor) on Oct 28, 2008 at 08:19 UTC |