Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Mugged by UTF8, this CANNOT be right

by ikegami (Patriarch)
on Jan 26, 2011 at 18:13 UTC ( [id://884395]=note: print w/replies, xml ) Need Help??


in reply to Mugged by UTF8, this CANNOT be right

Encode inputs? Inputs should be decoded (automatically or otherwise). Speaking of being new to "accents"...

Are accents a relatively new thing for PERL programmers?

(The language is "Perl".)

They are relatively new to Perl itself. Perl has been along long before Unicode. Lots of code predates Unicode support, so there are backwards compatibility issues.

I was under the impression that the none of the DBDs (give the option to) decode text fetched from the database. This is unfortunate, because it means we need to know the encoding the DB uses. Are you saying that some DBDs do decode text?

  • Comment on Re: Mugged by UTF8, this CANNOT be right

Replies are listed 'Best First'.
Re^2: Mugged by UTF8, this CANNOT be right
by FalseVinylShrub (Chaplain) on Jan 26, 2011 at 19:30 UTC

    Hi

    ikegami said:

    was under the impression that the none of the DBDs (give the option to) decode text fetched from the database. This is unfortunate, because it means we need to know the encoding the DB uses. Are you saying that some DBDs do decode text?

    I suspect the mysql_enable_utf8 is derived from pg_enable_utf8, which simply sets the UTF8 flag on everything that comes back from the database.

    There have been proposals to add encoding support at the DBI level, but I've not heard about them being released yet.

    As for why it's not doing what is desired for the O.P... Perhaps a full, self-contained test program would help?

    Regards

    FalseVinylShrub

    Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.

      The problem is that it's a very large application so to break out anything self-contained is not possible.

      What I did notice is that everything was working just fine, my Template Toolkit templates have BOMs, my DB is all UTF8 encoded, my charsets were perfect.

      Everything worked great, probably because PERL was doing the right thing, but don't forget there's SIX places for UTF8 to get messed up:
      1) Template encoding
      2) HTTP headers
      3) HTML headers
      4) DB encoding
      5) DB handle
      6) The language itself

      That's suddenly a lot of room for forgetting one detail that throws everything else off.

      With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database. But not only does it have to be Encoded, but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels.

      And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time, but the problem remains that it seems that the only way to be certain is to encode/decode all input and output and that's just not the way things should work, 10% of my programming should not have to be worrying about this issue.

      Tosh

        And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time

        Perl doesn't guess at encodings, so I don't know to what you are referring.

        Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever.

        Same goes for output. For example, file handles expect bytes unless configured otherwise. It's the writer's responsibility to convert the output into bytes or to configure the file handle to do the conversion for it.

        but don't forget there's SIX places for UTF8 to get messed up:

        At least everywhere data is serialised or deserialised. I think you missed a couple. A more complete list:

        Inputs:

        • Source code: Decoding.
        • HTML form data(?): Decoding.
        • Database: Decoding.
        • Template: Decoding.

        Outputs:

        • Database (queries and parameters(?)): Encoding.
        • HTML response: Encoding and inclusion of Content-Type header.
        • HTTP response: Inclusion of Content-Type in header.
        • Error log: Encoding.

        The nice thing is that they are all independent. Fixing a problem with one doesn't require the cooperation of others.

        With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database.

        No, that wasn't the only way to fix it. Two wrongs made a right, but introduced many other problems. Specifically, it broke length, substr, regular expressions and much more.

        $ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 0 1 1 $ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 1 2 0

        but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels.

        Good. You're checking for undef, which isn't a string. Encoding something that isn't a string is most definitely an error. I don't know why you mention this.

        Hi

        What version of Perl are you using?

        FalseVinylShrub

        Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.

      I suspect the mysql_enable_utf8 is derived from pg_enable_utf8, which simply sets the UTF8 flag on everything that comes back from the database.

      Yes, sorry, I got it backwards. I was thinking the enable_utf8 affected data sent to the DB, but it affects DB obtained from the DB.

      Either way, it's a very incomplete system. Only UTF-8 is supported (right?), and it's broken when it comes to data sent to the DB.

Re^2: Mugged by UTF8, this CANNOT be right
by mje (Curate) on Jan 27, 2011 at 10:16 UTC
    I was under the impression that the none of the DBDs (give the option to) decode text fetched from the database. This is unfortunate, because it means we need to know the encoding the DB uses. Are you saying that some DBDs do decode text?

    A number of DBD's dencode the data returned from the database including DBD::ODBC (in a unicode build of it or when instructed to with handle flags), DBD::Oracle, DBD::Pg and DBD::mysql. There may be others.

      I corrected myself already, but it's not nearly as functional as you make it sound. See the linked post.

        Could you point me specifically at the "linked post" as I'm not sure which one you are referring to so I cannot comment on how functional it is that respect - thanks. However, I am using DBD::ODBC and DBD::Oracle to insert and retrieve unicode data to/from many databases with no encoding issues anyone has mentioned here. As far as I am concerned it does just work so long as you get your database set up correctly.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://884395]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-20 02:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found