Re^2: Handling malformed UTF-16 data with PerlIO layer

Replies are listed 'Best First'.
Re^3: Handling malformed UTF-16 data with PerlIO layer by ikegami (Patriarch) on Oct 28, 2008 at 00:24 UTC
Lookahead alone won't do because the pair might be cut into two reads. It does make things more complicated. I don't know anything about surrogates. I assumed the following: hi followed by lo = ok hi not followed by lo = bad lo not preceeded by hi = bad #!/usr/bin/perl # usage: # fix_surrogates.pl < infile > outfile # Hi Surrogate: D800-DBFF # Lo Surrogate: DC00-DFFF use strict; use warnings; binmode STDIN; # Disable :crlf binmode STDOUT; # Disable :crlf my $read_size = 161024; my $valid_pat = qr/ .[^\xD8-\xDF] \| .[\xD8-\xDB].[\xDC-\xDF] /xs; my $invalid_pat = qr/ .[\xDC-\xDF] \| .[\xD8-\xDB](?=.[^\xDC-\xDF]) /xs; my $ibuf = ''; my $obuf = ''; for (;;) { my $rv = read(STDIN, $ibuf, $read_size, length($ibuf)); die("$!\n") if !defined($rv); last if !$rv; for ($ibuf) { /\G ($valid_pat+) /xgc && do { $obuf .= $1; }; /\G $invalid_pat /xgc && do { $obuf .= "\xFD\xFF"; redo }; } print($obuf); $ibuf = substr($ibuf, pos($ibuf)\|\|0); $obuf = ''; } $ibuf =~ s/..?/\xFD\xFF/sg; print($ibuf); [download] Update*: Tested. Fixed character class that wasn't negated as it should have been. >type testdata.pl binmode STDOUT; my $hi = "\xF4\xDB"; my $lo = "\xE2\xDE"; print "a\0" . $hi . $lo . "b\0" . "\n\0", "c\0" . $hi . "c\0" . "d\0" . "\n\0", "e\0" . $lo . "f\0" . "g\0" . "\n\0"; >perl testdata.pl \| perl fix_surrogates.pl \| perl -0777 -pe"BEGIN { bi +nmode STDIN, ':encoding(UTF-16le)'; binmode STDOUT, ':encoding(US-ASC +II)' }" "\x{10d2e2}" does not map to ascii, <> chunk 1. "\x{fffd}" does not map to ascii, <> chunk 1. "\x{fffd}" does not map to ascii, <> chunk 1. a\x{10d2e2}b c\x{fffd}cd e\x{fffd}fg [download]	[reply] [d/l] [select]
Re^4: Handling malformed UTF-16 data with PerlIO layer by almut (Canon) on Oct 28, 2008 at 01:02 UTC
Thank you very much, again, for actually working out the details. I think I'll go with that approach — unless someone has a better suggestion... That said, my gut feelings of unease still hold about reimplementing a parser for an encoding I possibly have not fully understood (e.g. what are private-use high-surrogates, really? ...and who knows what else there might be).	[reply]
Re^5: Handling malformed UTF-16 data with PerlIO layer by graff (Chancellor) on Oct 28, 2008 at 06:33 UTC
(e.g. what are private-use high-surrogates, really? ...and who knows what else there might be). There is no such thing as "private-use high-surrogates". There is a region of the unicode space reserved for "private use" (from E000 thru F8FF), and there is the region set aside for "surrogates" (from D800 thru DFFF). There's also a "supplementary private use" area running from F0000 - 10FFFF, which is not relevant here (note the extra digits). There is no "supplemental surrogates" area -- the surrogate region is "special" and unique, reserved specifically so that UTF-16 encodings have a way of representing code points above FFFF (in much the same way that byte-oriented utf8 handles code points above FF). In effect, UTF-16 is a "variable-width" encoding in the case where code points above FFFF are being used -- such "higher-plane" code points must be expressed via two UTF-16 values. Since the very highest Unicode code point is 10FFFF (21 bits), and since the high 5 bits are only used for 16 distinct "upper planes" (01....-10...., hence 4 bits worth), the surrogate region provides for the 20 "significant" bits to be split over two 16-bit words, where the high 6 bits of each word are rigidly fixed: first word of a surrogate pair must have 110110 (D800-DBFF for the "High" 10 bits), second word must have 110111 (DC00-DFFF for the "Low" 10 bits). This serves to explain why you cannot convert a 16-bit value in the surrogate range into a utf8 character -- no characters (no code points) can be defined within that range of 16-bit values. But when a code point above FFFF is correctly encoded into UTF-16, you get surrogates (a pair of 16-bit values, one each in the "High" and "Low" regions of the surrogate range). Regarding ikegami's observation about FFFE and FFFF, I noticed that this is a difference between 5.8.8 and 5.10.0 -- Encode handles these code points in 5.8 but it spits out the error in 5.10. It's certainly true that Unicode explicitly reserves these values as "non-characters." I'm not sure whether 5.8 or 5.10 has the better approach, and I sort of expect that it might depend on the circumstances. I looked for something about this in perldelta, but didn't see anything explicit. In addition to those two "non-character" code points, the same result applies to the range FDD0 - FDEF. According to the unicode reference page, "These codes are intended for process-internal uses, but are not permitted for interchange." I don't really know what ~~that~~ process-internal uses means (but not permitted for interchange seems pretty clear). In any case, here's a test script for identifying all the unsavory (error-inducing) 16-bit values -- you can run this in both 5.8.8 and 5.10.0 to see how the two versions differ in their behavior. I think the "eval" technique here might be a decent approach for what you need to do with your data -- I'm afraid you'll need to ditch the idea of using the PerlIO::encoding layer, and should probably go with reading into a fixed-sized buffer, Check out the description of FB_WARN in the Encode man page, because it handles the case where you are doing fixed-size buffer reads and get a partial character at the end of a given buffer. `use Encode; binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; for (0x0..0xffff) { $c = pack( "v", $_ ); eval { $u = decode( "UTF-16LE", $c, Encode::FB_WARN ) }; if ( $@ ) { warn $@; print "\x{FEFF}\n"; } else { $u = '\\n' if ( $u eq "\n" ); # just so LF doesn't show up as + two lines print "$u\n"; } }` [download]	[reply] [d/l]
Re^6: Handling malformed UTF-16 data with PerlIO layer by almut (Canon) on Oct 28, 2008 at 20:12 UTC
Re^7: Handling malformed UTF-16 data with PerlIO layer by graff (Chancellor) on Oct 28, 2008 at 22:24 UTC
Re^6: Handling malformed UTF-16 data with PerlIO layer by ikegami (Patriarch) on Oct 28, 2008 at 10:13 UTC
Re^5: Handling malformed UTF-16 data with PerlIO layer by ikegami (Patriarch) on Oct 28, 2008 at 02:34 UTC
and who knows what else there might be `U+FFFE` and `U+FFFF` are invalid. `>perl -e"print qq{\xFE\xFF}" \| perl -e"binmode STDIN, ':encoding(UTF-1 +6le)'; <>" UTF-16LE:Unicode character fffe is illegal at -e line 1. >perl -e"print qq{\xFF\xFF}" \| perl -e"binmode STDIN, ':encoding(UTF-1 +6le)'; <>" UTF-16LE:Unicode character ffff is illegal at -e line 1.` [download] Same for UCS-2. There could be more.	[reply] [d/l] [select]


Don't ask to ask, just ask
	PerlMonks