Re: Mugged by UTF8, this CANNOT be right

Replies are listed 'Best First'.
Re^2: Mugged by UTF8, this CANNOT be right by FalseVinylShrub (Chaplain) on Jan 26, 2011 at 19:30 UTC
Hi ikegami said: was under the impression that the none of the DBDs (give the option to) decode text fetched from the database. This is unfortunate, because it means we need to know the encoding the DB uses. Are you saying that some DBDs do decode text? I suspect the mysql_enable_utf8 is derived from pg_enable_utf8, which simply sets the UTF8 flag on everything that comes back from the database. There have been proposals to add encoding support at the DBI level, but I've not heard about them being released yet. As for why it's not doing what is desired for the O.P... Perhaps a full, self-contained test program would help? Regards FalseVinylShrub Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.	[reply]
Re^3: Mugged by UTF8, this CANNOT be right by tosh (Scribe) on Jan 26, 2011 at 19:52 UTC
The problem is that it's a very large application so to break out anything self-contained is not possible. What I did notice is that everything was working just fine, my Template Toolkit templates have BOMs, my DB is all UTF8 encoded, my charsets were perfect. Everything worked great, probably because PERL was doing the right thing, but don't forget there's SIX places for UTF8 to get messed up: 1) Template encoding 2) HTTP headers 3) HTML headers 4) DB encoding 5) DB handle 6) The language itself That's suddenly a lot of room for forgetting one detail that throws everything else off. With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database. But not only does it have to be Encoded, but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels. And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time, but the problem remains that it seems that the only way to be certain is to encode/decode all input and output and that's just not the way things should work, 10% of my programming should not have to be worrying about this issue. Tosh	[reply]
Re^4: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 26, 2011 at 21:01 UTC
And what I believe is happening is that for 90% of the people out there working with UTF8 the "guessing" that Perl does works most of the time Perl doesn't guess at encodings, so I don't know to what you are referring. Data sources typically return bytes since the data source has no idea what the bytes represent. It's the data reader's responsibility to convert those bytes into numbers, text or whatever. Same goes for output. For example, file handles expect bytes unless configured otherwise. It's the writer's responsibility to convert the output into bytes or to configure the file handle to do the conversion for it. but don't forget there's SIX places for UTF8 to get messed up: At least everywhere data is serialised or deserialised. I think you missed a couple. A more complete list: Inputs: Source code: Decoding. HTML form data(?): Decoding. Database: Decoding. Template: Decoding. Outputs: Database (queries and parameters(?)): Encoding. HTML response: Encoding and inclusion of Content-Type header. HTTP response: Inclusion of Content-Type in header. Error log: Encoding. The nice thing is that they are all independent. Fixing a problem with one doesn't require the cooperation of others. With a small change to the application the internal "guessing" of Perl was suddenly wrong 50% of the time, and the only way to fix it was to Encode EVERY piece of data coming from the database. No, that wasn't the only way to fix it. Two wrongs made a right, but introduced many other problems. Specifically, it broke `length`, `substr`, regular expressions and much more. `$ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 0 1 1 $ perl -wE'use utf8; $_=chr(0xA2); utf8::encode($_) if $ARGV[0]; say l +ength; say /^¢\z/ ?1:0' 1 2 0` [download] but it has to be checked FIRST, because if you don't then Encode.pm spews warnings like an 18 year old after a bottle of Jack Daniels. Good. You're checking for undef, which isn't a string. Encoding something that isn't a string is most definitely an error. I don't know why you mention this.	[reply] [d/l] [select]
Re^5: Mugged by UTF8, this CANNOT be right by tosh (Scribe) on Jan 26, 2011 at 21:26 UTC
Re^6: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 27, 2011 at 00:38 UTC
Some notes below your chosen depth have not been shown here
Re^5: Mugged by UTF8, this CANNOT be right by Jim (Curate) on Jan 27, 2011 at 00:37 UTC
Re^6: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 27, 2011 at 02:17 UTC
Re^6: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 27, 2011 at 01:35 UTC
Re^5: Mugged by UTF8, this CANNOT be right by Jim (Curate) on Jan 27, 2011 at 01:10 UTC
Re^6: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 27, 2011 at 03:12 UTC
Re^4: Mugged by UTF8, this CANNOT be right by FalseVinylShrub (Chaplain) on Jan 26, 2011 at 20:19 UTC
Hi What version of Perl are you using? FalseVinylShrub Disclaimer: Please review and test code, and use at your own risk... If I answer a question, I would like to hear if and how you solved your problem.	[reply]
Re^5: Mugged by UTF8, this CANNOT be right by tosh (Scribe) on Jan 26, 2011 at 20:50 UTC
Re^3: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 26, 2011 at 21:09 UTC
I suspect the mysql_enable_utf8 is derived from pg_enable_utf8, which simply sets the UTF8 flag on everything that comes back from the database. Yes, sorry, I got it backwards. I was thinking the `enable_utf8` affected data sent to the DB, but it affects DB obtained from the DB. Either way, it's a very incomplete system. Only UTF-8 is supported (right?), and it's broken when it comes to data sent to the DB.	[reply] [d/l]
Re^2: Mugged by UTF8, this CANNOT be right by mje (Curate) on Jan 27, 2011 at 10:16 UTC
I was under the impression that the none of the DBDs (give the option to) decode text fetched from the database. This is unfortunate, because it means we need to know the encoding the DB uses. Are you saying that some DBDs do decode text? A number of DBD's dencode the data returned from the database including DBD::ODBC (in a unicode build of it or when instructed to with handle flags), DBD::Oracle, DBD::Pg and DBD::mysql. There may be others.	[reply]
Re^3: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 27, 2011 at 16:54 UTC
I corrected myself already, but it's not nearly as functional as you make it sound. See the linked post.	[reply]
Re^4: Mugged by UTF8, this CANNOT be right by mje (Curate) on Jan 27, 2011 at 17:34 UTC
Could you point me specifically at the "linked post" as I'm not sure which one you are referring to so I cannot comment on how functional it is that respect - thanks. However, I am using DBD::ODBC and DBD::Oracle to insert and retrieve unicode data to/from many databases with no encoding issues anyone has mentioned here. As far as I am concerned it does just work so long as you get your database set up correctly.	[reply]
Re^5: Mugged by UTF8, this CANNOT be right by mje (Curate) on Jan 27, 2011 at 17:37 UTC
Re^6: Mugged by UTF8, this CANNOT be right by ikegami (Patriarch) on Jan 27, 2011 at 17:57 UTC
Some notes below your chosen depth have not been shown here


go ahead... be a heretic
	PerlMonks