|Perl: the Markov chain saw|
Re^5: Handling malformed UTF-16 data with PerlIO layerby graff (Chancellor)
|on Oct 28, 2008 at 06:33 UTC||Need Help??|
(e.g. what are private-use high-surrogates, really? ...and who knows what else there might be).
There is no such thing as "private-use high-surrogates". There is a region of the unicode space reserved for "private use" (from E000 thru F8FF), and there is the region set aside for "surrogates" (from D800 thru DFFF). There's also a "supplementary private use" area running from F0000 - 10FFFF, which is not relevant here (note the extra digits).
There is no "supplemental surrogates" area -- the surrogate region is "special" and unique, reserved specifically so that UTF-16 encodings have a way of representing code points above FFFF (in much the same way that byte-oriented utf8 handles code points above FF).
In effect, UTF-16 is a "variable-width" encoding in the case where code points above FFFF are being used -- such "higher-plane" code points must be expressed via two UTF-16 values. Since the very highest Unicode code point is 10FFFF (21 bits), and since the high 5 bits are only used for 16 distinct "upper planes" (01....-10...., hence 4 bits worth), the surrogate region provides for the 20 "significant" bits to be split over two 16-bit words, where the high 6 bits of each word are rigidly fixed: first word of a surrogate pair must have 110110 (D800-DBFF for the "High" 10 bits), second word must have 110111 (DC00-DFFF for the "Low" 10 bits).
This serves to explain why you cannot convert a 16-bit value in the surrogate range into a utf8 character -- no characters (no code points) can be defined within that range of 16-bit values. But when a code point above FFFF is correctly encoded into UTF-16, you get surrogates (a pair of 16-bit values, one each in the "High" and "Low" regions of the surrogate range).
Regarding ikegami's observation about FFFE and FFFF, I noticed that this is a difference between 5.8.8 and 5.10.0 -- Encode handles these code points in 5.8 but it spits out the error in 5.10. It's certainly true that Unicode explicitly reserves these values as "non-characters." I'm not sure whether 5.8 or 5.10 has the better approach, and I sort of expect that it might depend on the circumstances. I looked for something about this in perldelta, but didn't see anything explicit.
In addition to those two "non-character" code points, the same result applies to the range FDD0 - FDEF. According to the unicode reference page, "These codes are intended for process-internal uses, but are
not permitted for interchange." I don't really know what
In any case, here's a test script for identifying all the unsavory (error-inducing) 16-bit values -- you can run this in both 5.8.8 and 5.10.0 to see how the two versions differ in their behavior.
I think the "eval" technique here might be a decent approach for what you need to do with your data -- I'm afraid you'll need to ditch the idea of using the PerlIO::encoding layer, and should probably go with reading into a fixed-sized buffer, Check out the description of FB_WARN in the Encode man page, because it handles the case where you are doing fixed-size buffer reads and get a partial character at the end of a given buffer.