Unicode perlio error (when multibyte UTF-8 characters are split across block boundaries-- is it a perl bug or an I'm stupid bug?

DrWhy has asked for the wisdom of the Perl Monks concerning the following question:

Hello Fellow Monks,

We've been having a difficult time with some of our Perl code that processes UTF-8 data. (Who hasn't?) Our code appears to be complaining about malformed UTF-8 data that is not malformed. We are down to thinking that either this is a perl bug (my Googling hasn't turned up any references to this behavior, though) or are we some how subtly mishandling the data and maybe accidentally double encoding it. I've included below a simple perl script that demonstrates this behavior.

use utf8;
use feature ':5.10';

use Encode qw/:fallback_all/;
use PerlIO::encoding;
BEGIN{ $PerlIO::encoding::fallback = FB_WARN; }
use open IO => ':encoding(UTF-8)', ':std';

# show current PerIO layers of stdout on stderr
my @layers = PerlIO::get_layers(STDOUT, details => 1);
for (my $i=0; $i<@layers; $i+=3) {
    printf STDERR "stdout layer %d: (%s,%s,0x%X)\n", $i/3,
    $layers[$i], $layers[$i+1]||'',$layers[$i+2];
}
print STDERR "\n";

# show [a] that Encode works, and [b] the bytes around the bad area.
sub examine($) {
    my $s = shift;

    my $encoded = Encode::encode( "UTF-8", $s, FB_CROAK );
    warn "# utf-8 encoded octet dump (total=", length($encoded), "):\n
+";



    my @x = unpack("C*", $encoded);
    for ( my $i=1010; $i<1028; ++$i ) {
    printf STDERR " enc[%d]=0x%02X", $i, $x[$i];
    }
    print STDERR "\n";
}

# create sample test string
my $test = '';
while ( length($test) <= 520 ) { # typically 2-byte codes
    $test .= ' &#1087;&#1088;&#1077;&#1076;&#1086;&#1089;&#1090;&#1072
+;&#1074;&#1083;&#1077;&#1085;&#1072;';
}
print "encoded: " . (utf8::is_utf8($test) ? 'yes' : 'no') . "\n";
my $len = length $test;
print "length: $len\n";

# this one works
examine($test);
print "$test\n\n";

# add a 1-byte UTF-8 code point (x) in front, and it fails
# in the presence of "use PerlIO::encoding" or "use open IO => .."
$test = 'x' . $test;

print "encoded: " . (utf8::is_utf8($test) ? 'yes' : 'no') . "\n";
$len = length $test;

print "length: $len\n";

examine($test);
print "$test\n\n";
[download]

This creates a UTF-8 string ( some cyrillic characters which are all two bytes long in UTF-8), prints it out, then prepends a single byte UTF-8 character and prints it out again. That's the core test, and there's a number of debugging prints that show the length and utf-8-ness of the string. The key bad behavior is that when the first version of the string is printed, everything's fine, but the second version of the string produces a UTF-8 warning about an non-UTF byte, D0. Here's the output on our system:

$ perl perlbug.pl
stdout layer 0: (unix,,0xC89200)
stdout layer 1: (perlio,,0xC89200)
stdout layer 2: (encoding,utf-8-strict,0xC89200)

encoded: yes
length: 532
# utf-8 encoded octet dump (total=1026):
 enc1010=0xD1 enc1011=0x81 enc1012=0xD1 enc1013=0x82 enc1014=0xD0 enc1015=0xB0 enc1016=0xD0 enc1017=0xB2 enc1018=0xD0 enc1019=0xBB enc1020=0xD0 enc1021=0xB5 enc1022=0xD0 enc1023=0xBD enc1024=0xD0 enc1025=0xB0Use of uninitialized value in printf at perlbug.pl line 31.
 enc1026=0x00Use of uninitialized value in printf at perlbug.pl line 31.
 enc1027=0x00
 предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена

encoded: yes
length: 533
# utf-8 encoded octet dump (total=1027):
 enc1010=0xBE enc1011=0xD1 enc1012=0x81 enc1013=0xD1 enc1014=0x82 enc1015=0xD0 enc1016=0xB0 enc1017=0xD0 enc1018=0xB2 enc1019=0xD0 enc1020=0xBB enc1021=0xD0 enc1022=0xB5 enc1023=0xD0 enc1024=0xBD enc1025=0xD0 enc1026=0xB0Use of uninitialized value in printf at perlbug.pl line 31.
 enc1027=0x00
"\x{00d0}" does not map to utf8 at perlbug.pl line 59.
x предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена предоставлена

It looks like the key difference between the two is that in the second string the offending D0 byte, which is the first byte of a multibyte UTF-8 character encoding, is at exactly the 1024th position in the string. This implies that the issue is with output buffering not doing the right thing when a multibyte character crosses a buffer boundary.

Does this look familiar to anyone, a known bug? Is there a workaround? Or are we handling the UTF-8 string improperly somehow?

The operating environment is perl 5.10.0 (and we aren't in a position to upgrade at this time, so if this is a bug we'll need to work around it somehow if it's a bug in Perl) on Linux FC 12.

Thanks for any insight the monastery can provide

--DrWhy

"If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."

Comment on Unicode perlio error (when multibyte UTF-8 characters are split across block boundaries-- is it a perl bug or an I'm stupid bug? Download Code

Replies are listed 'Best First'.
Re: Unicode perlio error (when multibyte UTF-8 characters are split across block boundaries-- is it a perl bug or an I'm stupid bug? by choroba (Cardinal) on Dec 11, 2013 at 19:25 UTC
If you comment out PerlIO::encoding, everything seems to work. Why do you use it? لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^2: Unicode perlio error (when multibyte UTF-8 characters are split across block boundaries-- is it a perl bug or an I'm stupid bug? by DrWhy (Chaplain) on Dec 11, 2013 at 20:56 UTC
I believe all that does is disable the error detection and reporting. It does nothing to fix the underlying issue, it just hides it. --DrWhy "If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."	[reply]
Re^3: Unicode perlio error (when multibyte UTF-8 characters are split across block boundaries-- is it a perl bug or an I'm stupid bug? (source) by tye (Sage) on Dec 11, 2013 at 22:03 UTC
It indeed appears to be a bug in PerlIO::encoding. See http://cpansearch.perl.org/src/RJBS/perl-5.18.1/ext/PerlIO-encoding/encoding.xs to note that "encode" and "decode" methods are simply called with a buffer full of bytes with no attempts to handle incomplete multi-byte characters across buffer boundaries. To fix this efficiently, you'd want Encode's encode() and decode() methods (or similar) to support "tell me how many bytes on the end to save for later as they could be incomplete multi-byte characters in the desired encoding". Ah, I see FB_QUIET is already there for just that purpose. Unfortunately, using that completely defeats the purpose of allowing options like FB_WARN and FB_CROAK. Plus I don't see how that code makes it reasonable to detect invalid characters instead of just ending up in an endless loop of converting 0 bytes over and over. It would be helpful for something similar to FB_QUIET to be defined as a bit that can be combined with FB_WARN or FB_CROAK such that failing at the first byte or (better) too far before the end of the buffer triggers the warn/croak but failing with a single, incomplete fragment of a multi-byte character on the end of the buffer acts like FB_QUIET would. But surely that's already plenty of information for you to file the bug report, eh? - tye	[reply]
Re^3: Unicode perlio error (when multibyte UTF-8 characters are split across block boundaries-- is it a perl bug or an I'm stupid bug? by choroba (Cardinal) on Dec 11, 2013 at 21:38 UTC
Please, demonstrate the error. The output of the two versions of the script (with PerlIO::encoding and without it) is exactly the same. لսႽ� ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^4: Unicode perlio error (when multibyte UTF-8 characters are split across block boundaries-- is it a perl bug or an I'm stupid bug? by DrWhy (Chaplain) on Dec 31, 2013 at 20:03 UTC


Clear questions and runnable code get the best and fastest answer
	PerlMonks