Re^3: A UTF8 round trip with MySQL

Replies are listed 'Best First'.
Re^4: A UTF8 round trip with MySQL by Juerd (Abbot) on Jun 13, 2007 at 20:52 UTC
I'd be interested to know the risks involved. The most obvious risk involved is that your program can halt if you have malformed internal data. The error "Malformed UTF-8 character" is fatal. Less obvious risks include security bugs because things may be interpreted differently at different levels: something may pass an untainting regex, but still be unsafe in a library call. This is because there is no single standard way of dealing with malformed byte sequences. With naive (yet common) C code it can even lead to data corruption. The following change is in current blead: --- perl-current/pod/perldiag.pod 2007-01-02 19:17:01.000000000 ++0100 +++ mijn/pod/perldiag.pod 2007-03-03 18:12:23.000000000 +0100 @@ -2263,12 +2263,19 @@ =item Malformed UTF-8 character (%s) -(S utf8) (F) Perl detected something that didn't comply with UTF-8 -encoding rules. +(S utf8) (F) Perl detected a string that didn't comply with UTF-8 +encoding rules, even though it had the UTF8 flag on. -One possible cause is that you read in data that you thought to be in -UTF-8 but it wasn't (it was for example legacy 8-bit data). Another -possibility is careless use of utf8::upgrade(). +One possible cause is that you set the UTF8 flag yourself for data th +at +you thought to be in UTF-8 but it wasn't (it was for example legacy +8-bit data). To guard against this, you can use Encode::decode_utf8. + +If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid by +te +sequences are handled gracefully, but if you use C<:utf8>, the flag i +s +set without validating the data, possibly resulting in this error +message. + +See also L<Encode/"Handling Malformed Data">. [download] Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply] [d/l]
Re^5: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 21:01 UTC
I now get the 'unchecked input' part. And I can sort of understand issues with tainting. About the C code: you're talking about C code that tries to interpret the invalid utf-8, right? Because C's basic string operations don't look at the encoding, so they are just as (un)safe when you send them a non-utf8 marked string with miscellaneous binary data in it. update: about the (removed) line: "Another possibility is careless use of utf8::upgrade()." That's removed because utf8::upgrade() is always safe (if you start out with valid utf-8 flags), right? "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^6: A UTF8 round trip with MySQL by Juerd (Abbot) on Jun 13, 2007 at 21:12 UTC
About the C code: you're talking about C code that tries to interpret the invalid utf-8, right? Yes. I was specifically (but implicitly) referring to XS code, C code catered for Perl interaction. The UTF8 flag is interpreted as a promise that the buffer will be valid UTF8. Of course, it would be better to use Perl's macros for UTF8 handling, but that doesn't work if you're calling a library function that doesn't do SVs but does require valid UTF-8. about the (removed) line: "Another possibility is careless use of utf8::upgrade()." That's removed because utf8::upgrade() is always safe (if you start out with valid utf-8 flags), right? Exactly. The original author probably confused utf8::upgrade with Encode::_utf8_on. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^5: A UTF8 round trip with MySQL by mje (Curate) on Mar 31, 2009 at 10:24 UTC
I realise the quoted text is from blead but are you saying the :utf8 IO layer in earlier perls (say 5.8.8 for example) just sets the utf-8 flag without checking the encoding? If so then I don't understand the following in 5.8.8 `od -x x.data 0000000 8181 8282 8383 000a` [download] `use strict; use warnings; my $fh; open ($fh, "<:utf8", "x.data"); my $img = ''; while (<$fh>) {$img .= $_;}` [download] produces 1 utf8 "\x81" does not map to Unicode at invalid_utf8.pl line 8, <$fh> line 1. but changing the io layer to :encoding(UTF8) seems to make no difference other than reporting that same error 6 times, one for each byte.	[reply] [d/l] [select]


go ahead... be a heretic
	PerlMonks