http://www.perlmonks.org?node_id=719833

almut has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,
I would like to read UTF-16 files which occasionally contain malformed data, in particular malformed surrogate pairs (such as for example DBF4 (a "high surrogate") not followed by an expected "low surrogate" in the range DC00 - DFFF).

The straightforward approach of opening the file with

open my $fh, "<:encoding(UTF-16)", "somefile.utf16le" or die "...: $!" +;

unfortunately croaks with the error

UTF-16:Malformed LO surrogate dbf4 at ...

As I gleaned from the PerlIO::encoding docs, one solution might be to set $PerlIO::encoding::fallback = Encode::FB_DEFAULT in order to make the PerlIO layer assume the default behavior of the routine Encode::decode(), which is to replace the malformed character with the (valid) code point U+FFFD.  For example

use Encode; use PerlIO::encoding; $PerlIO::encoding::fallback = Encode::FB_DEFAULT; # BTW, why is FB_DEFAULT not the default +? my $malformed = "a\x00b\x00c\x00\xF4\xDBd\x00e\x00f\x00\n\x00" . "g\x00h\x00i\x00\n\x00"; # = "abc<some junk>def\nghi\n" in UTF-16LE open my $fh, "<:encoding(UTF-16LE)", \$malformed or die $!; while ( my $u = <$fh> ) { print $u; } close $fh;

Although this does work to some degree (i.e. it no longer croaks and does in fact translate malformed characters to FFFD), there are some irritating behaviors.

Most importantly, the above while loop becomes an endless loop, i.e. it starts reading from the beginning of the file after having reached its end. In other words, the output produced is

abc?ef ghi abc?ef ghi ... (repeated ad infinitum)

(This seems to be a bug, unless there's something specifically wrong with my perls — tried it with 5.10.0 and 5.8.8, both x86_64-linux.)

Some other secondary issues are:

BTW, I'm using the fancy open-stringref form just for purposes of easy demo. In real life I'm opening a regular file. Same behavior.

Any ideas how to get this working, preferably without resorting to reading the file in binary and doing the decoding myself (which I'd like to avoid for various reasons)? Bonus points, if the 'd' doesn't get swallowed... :)   Thanks!

Replies are listed 'Best First'.
Re: Handling malformed UTF-16 data with PerlIO layer
by ikegami (Patriarch) on Oct 27, 2008 at 21:49 UTC
    Why don't you fix the bad files instead of having your program handle them?
    #!/usr/bin/perl # usage: # rem_surrogate.pl < infile > outfile use strict; use warnings; binmode STDIN; # Disable :crlf binmode STDOUT; # Disable :crlf my $read_size = 16*1024; my $buf = ''; for (;;) { my $rv = read(STDIN, $buf, $read_size, length($buf)); die("$!\n") if !defined($rv); last if !$rv; $_ = substr($buf, 0, int(length($buf)/2)*2, ''); s/\G(.)(?:[\xD8-\xDF]|(.))/ defined($2) ? $1.$2 : "\xFD\xFF" /esg; print; } print("\xFD\xFF") if length($buf);
      Why don't you fix the bad files instead of having your program handle them?

      ...mostly because I'd rather avoid having to get down to the encoding nitty-gritties, if there is some 'proper' way of doing it with Perl's built-in encoding support.  For example, the ad-hoc approach you've shown would also replace valid surrogate pairs, which I'd rather keep, if possible (just in case). Sure, the regex could presumably be fixed to handle this (using lookahead), but this would be kind of reinventing the wheel...  OTOH, it looks like the best workaround for the issue so far — So, thanks!

        Lookahead alone won't do because the pair might be cut into two reads. It does make things more complicated.

        I don't know anything about surrogates. I assumed the following:

        • hi followed by lo = ok
        • hi not followed by lo = bad
        • lo not preceeded by hi = bad
        #!/usr/bin/perl # usage: # fix_surrogates.pl < infile > outfile # Hi Surrogate: D800-DBFF # Lo Surrogate: DC00-DFFF use strict; use warnings; binmode STDIN; # Disable :crlf binmode STDOUT; # Disable :crlf my $read_size = 16*1024; my $valid_pat = qr/ .[^\xD8-\xDF] | .[\xD8-\xDB].[\xDC-\xDF] /xs; my $invalid_pat = qr/ .[\xDC-\xDF] | .[\xD8-\xDB](?=.[^\xDC-\xDF]) /xs; my $ibuf = ''; my $obuf = ''; for (;;) { my $rv = read(STDIN, $ibuf, $read_size, length($ibuf)); die("$!\n") if !defined($rv); last if !$rv; for ($ibuf) { /\G ($valid_pat+) /xgc && do { $obuf .= $1; }; /\G $invalid_pat /xgc && do { $obuf .= "\xFD\xFF"; redo }; } print($obuf); $ibuf = substr($ibuf, pos($ibuf)||0); $obuf = ''; } $ibuf =~ s/..?/\xFD\xFF/sg; print($ibuf);

        Update: Tested. Fixed character class that wasn't negated as it should have been.

        >type testdata.pl binmode STDOUT; my $hi = "\xF4\xDB"; my $lo = "\xE2\xDE"; print "a\0" . $hi . $lo . "b\0" . "\n\0", "c\0" . $hi . "c\0" . "d\0" . "\n\0", "e\0" . $lo . "f\0" . "g\0" . "\n\0"; >perl testdata.pl | perl fix_surrogates.pl | perl -0777 -pe"BEGIN { bi +nmode STDIN, ':encoding(UTF-16le)'; binmode STDOUT, ':encoding(US-ASC +II)' }" "\x{10d2e2}" does not map to ascii, <> chunk 1. "\x{fffd}" does not map to ascii, <> chunk 1. "\x{fffd}" does not map to ascii, <> chunk 1. a\x{10d2e2}b c\x{fffd}cd e\x{fffd}fg
Re: Handling malformed UTF-16 data with PerlIO layer
by graff (Chancellor) on Oct 28, 2008 at 08:19 UTC
    I was hoping that a test like this would point the way to a decent solution, but having tried it on 5.8.8, it doesn't show the results I would want:
    use strict; use Encode; binmode STDOUT, ":utf8"; my %test_sets = ( normal => [ 0x40 .. 0x7f ], # normal ascii range of 64 characte +rs # puta some surrogate data on a record boundary: oksplit => [ 0x40 .. 0x5e, 0xd801, 0xdc01, 0x61 .. 0x7f ], # goo +d surrog. pair danglehi => [ 0x40 .. 0x5e, 0xd801, 0x60 .. 0x7f ], # bad: missing + Lo surrog. danglelo => [ 0x40 .. 0x5e, 0xdc01, 0x60 .. 0x7f ], # bad: missing + Hi surrog. invsplit => [ 0x40 .. 0x5e, 0xdc01, 0xd801, 0x61 .. 0x7f ], # two + surrog. errors # same as above, but not on a record boundary: okmid => [ 0x40 .. 0x4e, 0xd801, 0xdc01, 0x51 .. 0x7f ], # goo +d surrog. pair strandhi => [ 0x40 .. 0x4e, 0xd801, 0x50 .. 0x7f ], # bad: missing + Lo surrog. strandlo => [ 0x40 .. 0x4e, 0xdc01, 0x50 .. 0x7f ], # bad: missing + Hi surrog. invmid => [ 0x40 .. 0x4e, 0xdc01, 0xd801, 0x51 .. 0x7f ], # two + surrog. errors ); for my $type ( qw/normal okmid oksplit strandhi strandlo invmid danglehi danglelo invsplit/ ) { warn "\nRunning test on $type;\n"; print "\nRunning test on $type:\n"; my $string = pack( 'v*', @{$test_sets{$type}} ); my $u = ''; { open my $fh, "<", \$string or die $!; my $pass = 1; $_ = ''; while ( read( $fh, $_, 64, length())) { eval { $u .= decode( "UTF-16LE", $_, Encode::FB_WARN ) }; if ( $@ ) { warn sprintf( "on pass %d: %s; leaving %d bytes: \n", $pass, $@, length(), join( " ", unpack( +"v*", $_ ))); } $pass++; } } print "\n$u\n"; }
    If you try that out (redirecting stdout to a file), you'll see that valid surrogate pairs are handled nicely, whether they are record-internal or split across consecutive records, But if there is an improper surrogate anywhere in a given string, "decode()" does not return anything at all, and the entire string is left unprocessed.

    It looks to me like you'll need to use ikegami's approach of fixing the data to remove bad surrogate values before you try to decode from utf-16 to utf-8. Or at least, you'll need to use an eval block like the one above, and fix the input string whenever $@ indicates a surrogate error.