Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^3: A UTF8 round trip with MySQL

by Joost (Canon)
on Jun 13, 2007 at 20:34 UTC ( [id://621086]=note: print w/replies, xml ) Need Help??


in reply to Re^2: A UTF8 round trip with MySQL
in thread A UTF8 round trip with MySQL

Using "<:utf8" has worked fine for me so far. However, juerd does know about this stuff. I'd be interested to know the risks involved.

Replies are listed 'Best First'.
Re^4: A UTF8 round trip with MySQL
by Juerd (Abbot) on Jun 13, 2007 at 20:52 UTC

    I'd be interested to know the risks involved.

    The most obvious risk involved is that your program can halt if you have malformed internal data. The error "Malformed UTF-8 character" is fatal. Less obvious risks include security bugs because things may be interpreted differently at different levels: something may pass an untainting regex, but still be unsafe in a library call. This is because there is no single standard way of dealing with malformed byte sequences. With naive (yet common) C code it can even lead to data corruption.

    The following change is in current blead:

    --- perl-current/pod/perldiag.pod 2007-01-02 19:17:01.000000000 ++0100 +++ mijn/pod/perldiag.pod 2007-03-03 18:12:23.000000000 +0100 @@ -2263,12 +2263,19 @@ =item Malformed UTF-8 character (%s) -(S utf8) (F) Perl detected something that didn't comply with UTF-8 -encoding rules. +(S utf8) (F) Perl detected a string that didn't comply with UTF-8 +encoding rules, even though it had the UTF8 flag on. -One possible cause is that you read in data that you thought to be in -UTF-8 but it wasn't (it was for example legacy 8-bit data). Another -possibility is careless use of utf8::upgrade(). +One possible cause is that you set the UTF8 flag yourself for data th +at +you thought to be in UTF-8 but it wasn't (it was for example legacy +8-bit data). To guard against this, you can use Encode::decode_utf8. + +If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid by +te +sequences are handled gracefully, but if you use C<:utf8>, the flag i +s +set without validating the data, possibly resulting in this error +message. + +See also L<Encode/"Handling Malformed Data">.

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      I now get the 'unchecked input' part. And I can sort of understand issues with tainting. About the C code: you're talking about C code that tries to interpret the invalid utf-8, right? Because C's basic string operations don't look at the encoding, so they are just as (un)safe when you send them a non-utf8 marked string with miscellaneous binary data in it.

      update: about the (removed) line: "Another possibility is careless use of utf8::upgrade()."

      That's removed because utf8::upgrade() is always safe (if you start out with valid utf-8 flags), right?

        About the C code: you're talking about C code that tries to interpret the invalid utf-8, right?

        Yes. I was specifically (but implicitly) referring to XS code, C code catered for Perl interaction. The UTF8 flag is interpreted as a promise that the buffer will be valid UTF8. Of course, it would be better to use Perl's macros for UTF8 handling, but that doesn't work if you're calling a library function that doesn't do SVs but does require valid UTF-8.

        about the (removed) line: "Another possibility is careless use of utf8::upgrade()." That's removed because utf8::upgrade() is always safe (if you start out with valid utf-8 flags), right?

        Exactly. The original author probably confused utf8::upgrade with Encode::_utf8_on.

        Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      I realise the quoted text is from blead but are you saying the :utf8 IO layer in earlier perls (say 5.8.8 for example) just sets the utf-8 flag without checking the encoding? If so then I don't understand the following in 5.8.8

      od -x x.data 0000000 8181 8282 8383 000a
      use strict; use warnings; my $fh; open ($fh, "<:utf8", "x.data"); my $img = ''; while (<$fh>) {$img .= $_;}

      produces 1 utf8 "\x81" does not map to Unicode at invalid_utf8.pl line 8, <$fh> line 1.

      but changing the io layer to :encoding(UTF8) seems to make no difference other than reporting that same error 6 times, one for each byte.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://621086]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-25 19:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found