in reply to A UTF8 round trip with MySQL

That is a nice summary, although the only MySQL specific thing in it is {mysql_enable_utf8 => 1} :-).

You have one misleading bit of information, though:

For Perl to know whether the data it receives from an external source (which could be a string, or binary data such as an image) as a string of bytes or as a UTF-8 string, it uses the internal UTF8 flag.

This is a very dangerous assumption! The UTF8 flag is an internal flag that has nothing to do with anything that is external. If it is set, Perl assumes that it wrote the UTF8 buffer itself, and does no further checks. Blindly setting the UTF8 flag is dangerous because it can lead to internally corrupted scalars: malformed UTF8 data.

The :utf8 layer should not be used on input filehandles. Use :encoding(UTF-8) instead. The _utf8_on function should not be used on external input. Use decode("UTF-8", ...), or possibly decode("UTF8", ...) or decode_utf8(...) instead. You do this correctly.

The UTF8 flag indicates that internal data is UTF8 encoded, and that is regardless of source and history of this string.

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Replies are listed 'Best First'.
Re^2: A UTF8 round trip with MySQL
by clinton (Priest) on Jun 13, 2007 at 20:13 UTC
    Thanks Juerd

    The :utf8 layer should not be used on input filehandles. Use :encoding(UTF-8) instead.

    Why do you say this? It seems at odds with the docs for the open function and perlopentut, both of which give examples using it:

    open(my $fh, "<:utf8", $fn);

    thanks

    Clint

      It seems at odds with the docs for the open function and perlopentut, both of which give examples using it

      Ah, more documentation needs updates then! I'll look into it; thanks for the pointers.

      binmode in perlfunc, in the current development tree, already has the following change:

      -To mark FILEHANDLE as UTF-8, use C<:utf8>. +To mark FILEHANDLE as UTF-8, use C<:utf8>. This will fail on invalid +UTF-8 sequences; C<:encoding(UTF-8)> is a safer (but slightly less +efficient) choice.

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        I am not sure what could be safer than failing on invalid data - if invalid data is encountered, failing would be better than e.g. guessing and silently corrupting data.

        I'd be interested to know the risks involved.

        The most obvious risk involved is that your program can halt if you have malformed internal data. The error "Malformed UTF-8 character" is fatal. Less obvious risks include security bugs because things may be interpreted differently at different levels: something may pass an untainting regex, but still be unsafe in a library call. This is because there is no single standard way of dealing with malformed byte sequences. With naive (yet common) C code it can even lead to data corruption.

        The following change is in current blead:

        --- perl-current/pod/perldiag.pod 2007-01-02 19:17:01.000000000 ++0100 +++ mijn/pod/perldiag.pod 2007-03-03 18:12:23.000000000 +0100 @@ -2263,12 +2263,19 @@ =item Malformed UTF-8 character (%s) -(S utf8) (F) Perl detected something that didn't comply with UTF-8 -encoding rules. +(S utf8) (F) Perl detected a string that didn't comply with UTF-8 +encoding rules, even though it had the UTF8 flag on. -One possible cause is that you read in data that you thought to be in -UTF-8 but it wasn't (it was for example legacy 8-bit data). Another -possibility is careless use of utf8::upgrade(). +One possible cause is that you set the UTF8 flag yourself for data th +at +you thought to be in UTF-8 but it wasn't (it was for example legacy +8-bit data). To guard against this, you can use Encode::decode_utf8. + +If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid by +te +sequences are handled gracefully, but if you use C<:utf8>, the flag i +s +set without validating the data, possibly resulting in this error +message. + +See also L<Encode/"Handling Malformed Data">.

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }