Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Handling malformed UTF-16 data with PerlIO layer

by almut (Canon)
on Oct 27, 2008 at 20:56 UTC ( #719833=perlquestion: print w/ replies, xml ) Need Help??
almut has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,
I would like to read UTF-16 files which occasionally contain malformed data, in particular malformed surrogate pairs (such as for example DBF4 (a "high surrogate") not followed by an expected "low surrogate" in the range DC00 - DFFF).

The straightforward approach of opening the file with

open my $fh, "<:encoding(UTF-16)", "somefile.utf16le" or die "...: $!" +;

unfortunately croaks with the error

UTF-16:Malformed LO surrogate dbf4 at ...

As I gleaned from the PerlIO::encoding docs, one solution might be to set $PerlIO::encoding::fallback = Encode::FB_DEFAULT in order to make the PerlIO layer assume the default behavior of the routine Encode::decode(), which is to replace the malformed character with the (valid) code point U+FFFD.  For example

use Encode; use PerlIO::encoding; $PerlIO::encoding::fallback = Encode::FB_DEFAULT; # BTW, why is FB_DEFAULT not the default +? my $malformed = "a\x00b\x00c\x00\xF4\xDBd\x00e\x00f\x00\n\x00" . "g\x00h\x00i\x00\n\x00"; # = "abc<some junk>def\nghi\n" in UTF-16LE open my $fh, "<:encoding(UTF-16LE)", \$malformed or die $!; while ( my $u = <$fh> ) { print $u; } close $fh;

Although this does work to some degree (i.e. it no longer croaks and does in fact translate malformed characters to FFFD), there are some irritating behaviors.

Most importantly, the above while loop becomes an endless loop, i.e. it starts reading from the beginning of the file after having reached its end. In other words, the output produced is

abc?ef ghi abc?ef ghi ... (repeated ad infinitum)

(This seems to be a bug, unless there's something specifically wrong with my perls — tried it with 5.10.0 and 5.8.8, both x86_64-linux.)

Some other secondary issues are:

  • The malformed character substitution swallows the subsequent character (the 'd' in the above example). In other words, I would like the first line to be abc?def ('?' being the replacement char).  I figure this makes sense with UTF-16, as the 'd' is taken to be the low-surrogate of the invalid surrogate pair, with the whole pair being rendered into one replacement char. So I also tried UCS-2 in place of UTF-16, in the hope that it - being a fixed two-byte encoding - would not exhibit this behavior... However, there is no difference to UTF-16 in this regard.

  • If the last line doesn't end with a line feed, its content is somehow prepended to the previous line, producing the incorrect string "ghiabc?ef".

  • With Perl 5.8.8, setting $PerlIO::encoding::fallback as shown above does only work with UCS-2LE. With UTF-16LE and UTF-16 (correct BOM in place for the latter), this PerlIO setting doesn't seem to have any effect, i.e. it still complains about "malformed LO surrogate". With UCS-2, it doesn't produce any output at all, but just hangs with 100% CPU load, eating up more and more memory. With Perl 5.10.0, OTOH, setting $PerlIO::encoding::fallback does work with all four encodings. (Unfortunately, I would need to have it working with 5.8.8 and UTF-16... as well as preferably with a BOM, as the file in question is a Windows registry dump, which does have a BOM.)

BTW, I'm using the fancy open-stringref form just for purposes of easy demo. In real life I'm opening a regular file. Same behavior.

Any ideas how to get this working, preferably without resorting to reading the file in binary and doing the decoding myself (which I'd like to avoid for various reasons)? Bonus points, if the 'd' doesn't get swallowed... :)   Thanks!

Comment on Handling malformed UTF-16 data with PerlIO layer
Select or Download Code
Re: Handling malformed UTF-16 data with PerlIO layer
by ikegami (Pope) on Oct 27, 2008 at 21:49 UTC
    Why don't you fix the bad files instead of having your program handle them?
    #!/usr/bin/perl # usage: # rem_surrogate.pl < infile > outfile use strict; use warnings; binmode STDIN; # Disable :crlf binmode STDOUT; # Disable :crlf my $read_size = 16*1024; my $buf = ''; for (;;) { my $rv = read(STDIN, $buf, $read_size, length($buf)); die("$!\n") if !defined($rv); last if !$rv; $_ = substr($buf, 0, int(length($buf)/2)*2, ''); s/\G(.)(?:[\xD8-\xDF]|(.))/ defined($2) ? $1.$2 : "\xFD\xFF" /esg; print; } print("\xFD\xFF") if length($buf);
      Why don't you fix the bad files instead of having your program handle them?

      ...mostly because I'd rather avoid having to get down to the encoding nitty-gritties, if there is some 'proper' way of doing it with Perl's built-in encoding support.  For example, the ad-hoc approach you've shown would also replace valid surrogate pairs, which I'd rather keep, if possible (just in case). Sure, the regex could presumably be fixed to handle this (using lookahead), but this would be kind of reinventing the wheel...  OTOH, it looks like the best workaround for the issue so far — So, thanks!

        Lookahead alone won't do because the pair might be cut into two reads. It does make things more complicated.

        I don't know anything about surrogates. I assumed the following:

        • hi followed by lo = ok
        • hi not followed by lo = bad
        • lo not preceeded by hi = bad
        #!/usr/bin/perl # usage: # fix_surrogates.pl < infile > outfile # Hi Surrogate: D800-DBFF # Lo Surrogate: DC00-DFFF use strict; use warnings; binmode STDIN; # Disable :crlf binmode STDOUT; # Disable :crlf my $read_size = 16*1024; my $valid_pat = qr/ .[^\xD8-\xDF] | .[\xD8-\xDB].[\xDC-\xDF] /xs; my $invalid_pat = qr/ .[\xDC-\xDF] | .[\xD8-\xDB](?=.[^\xDC-\xDF]) /xs; my $ibuf = ''; my $obuf = ''; for (;;) { my $rv = read(STDIN, $ibuf, $read_size, length($ibuf)); die("$!\n") if !defined($rv); last if !$rv; for ($ibuf) { /\G ($valid_pat+) /xgc && do { $obuf .= $1; }; /\G $invalid_pat /xgc && do { $obuf .= "\xFD\xFF"; redo }; } print($obuf); $ibuf = substr($ibuf, pos($ibuf)||0); $obuf = ''; } $ibuf =~ s/..?/\xFD\xFF/sg; print($ibuf);

        Update: Tested. Fixed character class that wasn't negated as it should have been.

        >type testdata.pl binmode STDOUT; my $hi = "\xF4\xDB"; my $lo = "\xE2\xDE"; print "a\0" . $hi . $lo . "b\0" . "\n\0", "c\0" . $hi . "c\0" . "d\0" . "\n\0", "e\0" . $lo . "f\0" . "g\0" . "\n\0"; >perl testdata.pl | perl fix_surrogates.pl | perl -0777 -pe"BEGIN { bi +nmode STDIN, ':encoding(UTF-16le)'; binmode STDOUT, ':encoding(US-ASC +II)' }" "\x{10d2e2}" does not map to ascii, <> chunk 1. "\x{fffd}" does not map to ascii, <> chunk 1. "\x{fffd}" does not map to ascii, <> chunk 1. a\x{10d2e2}b c\x{fffd}cd e\x{fffd}fg
Re: Handling malformed UTF-16 data with PerlIO layer
by graff (Chancellor) on Oct 28, 2008 at 08:19 UTC
    I was hoping that a test like this would point the way to a decent solution, but having tried it on 5.8.8, it doesn't show the results I would want:
    use strict; use Encode; binmode STDOUT, ":utf8"; my %test_sets = ( normal => [ 0x40 .. 0x7f ], # normal ascii range of 64 characte +rs # puta some surrogate data on a record boundary: oksplit => [ 0x40 .. 0x5e, 0xd801, 0xdc01, 0x61 .. 0x7f ], # goo +d surrog. pair danglehi => [ 0x40 .. 0x5e, 0xd801, 0x60 .. 0x7f ], # bad: missing + Lo surrog. danglelo => [ 0x40 .. 0x5e, 0xdc01, 0x60 .. 0x7f ], # bad: missing + Hi surrog. invsplit => [ 0x40 .. 0x5e, 0xdc01, 0xd801, 0x61 .. 0x7f ], # two + surrog. errors # same as above, but not on a record boundary: okmid => [ 0x40 .. 0x4e, 0xd801, 0xdc01, 0x51 .. 0x7f ], # goo +d surrog. pair strandhi => [ 0x40 .. 0x4e, 0xd801, 0x50 .. 0x7f ], # bad: missing + Lo surrog. strandlo => [ 0x40 .. 0x4e, 0xdc01, 0x50 .. 0x7f ], # bad: missing + Hi surrog. invmid => [ 0x40 .. 0x4e, 0xdc01, 0xd801, 0x51 .. 0x7f ], # two + surrog. errors ); for my $type ( qw/normal okmid oksplit strandhi strandlo invmid danglehi danglelo invsplit/ ) { warn "\nRunning test on $type;\n"; print "\nRunning test on $type:\n"; my $string = pack( 'v*', @{$test_sets{$type}} ); my $u = ''; { open my $fh, "<", \$string or die $!; my $pass = 1; $_ = ''; while ( read( $fh, $_, 64, length())) { eval { $u .= decode( "UTF-16LE", $_, Encode::FB_WARN ) }; if ( $@ ) { warn sprintf( "on pass %d: %s; leaving %d bytes: \n", $pass, $@, length(), join( " ", unpack( +"v*", $_ ))); } $pass++; } } print "\n$u\n"; }
    If you try that out (redirecting stdout to a file), you'll see that valid surrogate pairs are handled nicely, whether they are record-internal or split across consecutive records, But if there is an improper surrogate anywhere in a given string, "decode()" does not return anything at all, and the entire string is left unprocessed.

    It looks to me like you'll need to use ikegami's approach of fixing the data to remove bad surrogate values before you try to decode from utf-16 to utf-8. Or at least, you'll need to use an eval block like the one above, and fix the input string whenever $@ indicates a surrogate error.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://719833]
Approved by Narveson
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2014-12-22 00:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (109 votes), past polls