![]() |
|
Your skill will accomplish what the force of many cannot |
|
PerlMonks |
Unicode perlio error (when multibyte UTF-8 characters are split across block boundaries-- is it a perl bug or an I'm stupid bug?by DrWhy (Chaplain) |
on Dec 11, 2013 at 17:33 UTC ( [id://1066669]=perlquestion: print w/replies, xml ) | Need Help?? |
DrWhy has asked for the wisdom of the Perl Monks concerning the following question: Hello Fellow Monks, We've been having a difficult time with some of our Perl code that processes UTF-8 data. (Who hasn't?) Our code appears to be complaining about malformed UTF-8 data that is not malformed. We are down to thinking that either this is a perl bug (my Googling hasn't turned up any references to this behavior, though) or are we some how subtly mishandling the data and maybe accidentally double encoding it. I've included below a simple perl script that demonstrates this behavior.
This creates a UTF-8 string ( some cyrillic characters which are all two bytes long in UTF-8), prints it out, then prepends a single byte UTF-8 character and prints it out again. That's the core test, and there's a number of debugging prints that show the length and utf-8-ness of the string. The key bad behavior is that when the first version of the string is printed, everything's fine, but the second version of the string produces a UTF-8 warning about an non-UTF byte, D0. Here's the output on our system: $ perl perlbug.pl stdout layer 0: (unix,,0xC89200) stdout layer 1: (perlio,,0xC89200) stdout layer 2: (encoding,utf-8-strict,0xC89200) encoded: yes length: 532 # utf-8 encoded octet dump (total=1026): enc1010=0xD1 enc1011=0x81 enc1012=0xD1 enc1013=0x82 enc1014=0xD0 enc1015=0xB0 enc1016=0xD0 enc1017=0xB2 enc1018=0xD0 enc1019=0xBB enc1020=0xD0 enc1021=0xB5 enc1022=0xD0 enc1023=0xBD enc1024=0xD0 enc1025=0xB0Use of uninitialized value in printf at perlbug.pl line 31. enc1026=0x00Use of uninitialized value in printf at perlbug.pl line 31. enc1027=0x00 предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена encoded: yes length: 533 # utf-8 encoded octet dump (total=1027): enc1010=0xBE enc1011=0xD1 enc1012=0x81 enc1013=0xD1 enc1014=0x82 enc1015=0xD0 enc1016=0xB0 enc1017=0xD0 enc1018=0xB2 enc1019=0xD0 enc1020=0xBB enc1021=0xD0 enc1022=0xB5 enc1023=0xD0 enc1024=0xBD enc1025=0xD0 enc1026=0xB0Use of uninitialized value in printf at perlbug.pl line 31. enc1027=0x00 "\x{00d0}" does not map to utf8 at perlbug.pl line 59. x предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена It looks like the key difference between the two is that in the second string the offending D0 byte, which is the first byte of a multibyte UTF-8 character encoding, is at exactly the 1024th position in the string. This implies that the issue is with output buffering not doing the right thing when a multibyte character crosses a buffer boundary. Does this look familiar to anyone, a known bug? Is there a workaround? Or are we handling the UTF-8 string improperly somehow? The operating environment is perl 5.10.0 (and we aren't in a position to upgrade at this time, so if this is a bug we'll need to work around it somehow if it's a bug in Perl) on Linux FC 12. Thanks for any insight the monastery can provide --DrWhy "If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."
Back to
Seekers of Perl Wisdom
|
|