Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: Handling malformed UTF-16 data with PerlIO layer

by almut (Canon)
on Oct 27, 2008 at 22:38 UTC ( [id://719851]=note: print w/replies, xml ) Need Help??


in reply to Re: Handling malformed UTF-16 data with PerlIO layer
in thread Handling malformed UTF-16 data with PerlIO layer

Why don't you fix the bad files instead of having your program handle them?

...mostly because I'd rather avoid having to get down to the encoding nitty-gritties, if there is some 'proper' way of doing it with Perl's built-in encoding support.  For example, the ad-hoc approach you've shown would also replace valid surrogate pairs, which I'd rather keep, if possible (just in case). Sure, the regex could presumably be fixed to handle this (using lookahead), but this would be kind of reinventing the wheel...  OTOH, it looks like the best workaround for the issue so far — So, thanks!

  • Comment on Re^2: Handling malformed UTF-16 data with PerlIO layer

Replies are listed 'Best First'.
Re^3: Handling malformed UTF-16 data with PerlIO layer
by ikegami (Patriarch) on Oct 28, 2008 at 00:24 UTC
    Lookahead alone won't do because the pair might be cut into two reads. It does make things more complicated.

    I don't know anything about surrogates. I assumed the following:

    • hi followed by lo = ok
    • hi not followed by lo = bad
    • lo not preceeded by hi = bad
    #!/usr/bin/perl # usage: # fix_surrogates.pl < infile > outfile # Hi Surrogate: D800-DBFF # Lo Surrogate: DC00-DFFF use strict; use warnings; binmode STDIN; # Disable :crlf binmode STDOUT; # Disable :crlf my $read_size = 16*1024; my $valid_pat = qr/ .[^\xD8-\xDF] | .[\xD8-\xDB].[\xDC-\xDF] /xs; my $invalid_pat = qr/ .[\xDC-\xDF] | .[\xD8-\xDB](?=.[^\xDC-\xDF]) /xs; my $ibuf = ''; my $obuf = ''; for (;;) { my $rv = read(STDIN, $ibuf, $read_size, length($ibuf)); die("$!\n") if !defined($rv); last if !$rv; for ($ibuf) { /\G ($valid_pat+) /xgc && do { $obuf .= $1; }; /\G $invalid_pat /xgc && do { $obuf .= "\xFD\xFF"; redo }; } print($obuf); $ibuf = substr($ibuf, pos($ibuf)||0); $obuf = ''; } $ibuf =~ s/..?/\xFD\xFF/sg; print($ibuf);

    Update: Tested. Fixed character class that wasn't negated as it should have been.

    >type testdata.pl binmode STDOUT; my $hi = "\xF4\xDB"; my $lo = "\xE2\xDE"; print "a\0" . $hi . $lo . "b\0" . "\n\0", "c\0" . $hi . "c\0" . "d\0" . "\n\0", "e\0" . $lo . "f\0" . "g\0" . "\n\0"; >perl testdata.pl | perl fix_surrogates.pl | perl -0777 -pe"BEGIN { bi +nmode STDIN, ':encoding(UTF-16le)'; binmode STDOUT, ':encoding(US-ASC +II)' }" "\x{10d2e2}" does not map to ascii, <> chunk 1. "\x{fffd}" does not map to ascii, <> chunk 1. "\x{fffd}" does not map to ascii, <> chunk 1. a\x{10d2e2}b c\x{fffd}cd e\x{fffd}fg

      Thank you very much, again, for actually working out the details. I think I'll go with that approach — unless someone has a better suggestion...

      That said, my gut feelings of unease still hold about reimplementing a parser for an encoding I possibly have not fully understood (e.g. what are private-use high-surrogates, really? ...and who knows what else there might be).

        (e.g. what are private-use high-surrogates, really? ...and who knows what else there might be).

        There is no such thing as "private-use high-surrogates". There is a region of the unicode space reserved for "private use" (from E000 thru F8FF), and there is the region set aside for "surrogates" (from D800 thru DFFF). There's also a "supplementary private use" area running from F0000 - 10FFFF, which is not relevant here (note the extra digits).

        There is no "supplemental surrogates" area -- the surrogate region is "special" and unique, reserved specifically so that UTF-16 encodings have a way of representing code points above FFFF (in much the same way that byte-oriented utf8 handles code points above FF).

        In effect, UTF-16 is a "variable-width" encoding in the case where code points above FFFF are being used -- such "higher-plane" code points must be expressed via two UTF-16 values. Since the very highest Unicode code point is 10FFFF (21 bits), and since the high 5 bits are only used for 16 distinct "upper planes" (01....-10...., hence 4 bits worth), the surrogate region provides for the 20 "significant" bits to be split over two 16-bit words, where the high 6 bits of each word are rigidly fixed: first word of a surrogate pair must have 110110 (D800-DBFF for the "High" 10 bits), second word must have 110111 (DC00-DFFF for the "Low" 10 bits).

        This serves to explain why you cannot convert a 16-bit value in the surrogate range into a utf8 character -- no characters (no code points) can be defined within that range of 16-bit values. But when a code point above FFFF is correctly encoded into UTF-16, you get surrogates (a pair of 16-bit values, one each in the "High" and "Low" regions of the surrogate range).

        Regarding ikegami's observation about FFFE and FFFF, I noticed that this is a difference between 5.8.8 and 5.10.0 -- Encode handles these code points in 5.8 but it spits out the error in 5.10. It's certainly true that Unicode explicitly reserves these values as "non-characters." I'm not sure whether 5.8 or 5.10 has the better approach, and I sort of expect that it might depend on the circumstances. I looked for something about this in perldelta, but didn't see anything explicit.

        In addition to those two "non-character" code points, the same result applies to the range FDD0 - FDEF. According to the unicode reference page, "These codes are intended for process-internal uses, but are not permitted for interchange." I don't really know what that process-internal uses means (but not permitted for interchange seems pretty clear).

        In any case, here's a test script for identifying all the unsavory (error-inducing) 16-bit values -- you can run this in both 5.8.8 and 5.10.0 to see how the two versions differ in their behavior.

        I think the "eval" technique here might be a decent approach for what you need to do with your data -- I'm afraid you'll need to ditch the idea of using the PerlIO::encoding layer, and should probably go with reading into a fixed-sized buffer, Check out the description of FB_WARN in the Encode man page, because it handles the case where you are doing fixed-size buffer reads and get a partial character at the end of a given buffer.

        use Encode; binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; for (0x0..0xffff) { $c = pack( "v", $_ ); eval { $u = decode( "UTF-16LE", $c, Encode::FB_WARN ) }; if ( $@ ) { warn $@; print "\x{FEFF}\n"; } else { $u = '\\n' if ( $u eq "\n" ); # just so LF doesn't show up as + two lines print "$u\n"; } }

        and who knows what else there might be

        U+FFFE and U+FFFF are invalid.

        >perl -e"print qq{\xFE\xFF}" | perl -e"binmode STDIN, ':encoding(UTF-1 +6le)'; <>" UTF-16LE:Unicode character fffe is illegal at -e line 1. >perl -e"print qq{\xFF\xFF}" | perl -e"binmode STDIN, ':encoding(UTF-1 +6le)'; <>" UTF-16LE:Unicode character ffff is illegal at -e line 1.

        Same for UCS-2.

        There could be more.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://719851]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-19 02:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found