Getting mad with CGI::Application and utf8

isync has asked for the wisdom of the Perl Monks concerning the following question:

Monks,
I am in deep trouble here and are pulling my hair out.

I am trying to implement a web application that cleanly operates in utf8 with CGI::Application. Is it possible?

That's where I am:

my code is in utf8: use utf8; at the beginning of application.cgi and all modules.
* a sidenote- I use
```
use CGI::Fast();
[download]
```
to enable fastcgi in my application.cgi
*in an act of helplessness is i added
```
use open ':utf8';
use open ':std';
[download]
```
to application.cgi, without really knowing what it does (and actually it seems to do nothing.
*in Application.pm I add $self->header_add(-charset => 'utf-8'); in setup { }
*after serious trouble with my MySQL data (where I set everything to utf8: db, connnection etc. see this thread) I found out that my perl script only outputs this data to display correctly in the browser with
```
    my $flagOn = $html->output;
    Encode::_utf8_on($flagOn);
    return $flagOn;
[download]
```
, actually I think that this only works on my linux browser environment, I saw a hint of malfunction on Windows... esoteric.
*finally file uploads came into play. That's when I dropped my conversion layer module for all param-stuff and began to manually add my $param_f = decode("utf8", $q->param("f") );to all my subroutines. Now - occasionally! (which it also does under simple cgi not fast cgi operation and seems to be connected to overall load) - my upload crashes with CGI.pm (version 3.33) throwing things like "Malformed UTF-8 character (unexpected non-continuation byte 0xd9, 1 byte after start byte 0xee, expected 3 bytes) in index at (eval 139) line 15." in apache's error.log and to the browser via Carp "CGI.pm: Server closed socket during multipart read (client aborted?)."... last idea: apache or my script?

update: I spawned another thread, as the fileupload problem seems to be losely connected to this problem.

Comment on Getting mad with CGI::Application and utf8 Select or Download Code

Replies are listed 'Best First'.
Re: Getting mad with CGI::Application and utf8 by moritz (Cardinal) on Feb 26, 2008 at 10:49 UTC
Did you check if CGI::Fast returns text strings (i.e. with UTF-8 flag set)? `use open ':utf8'` only affects open (iirc), so it's useless you unless open files. Do you use `binmode STDOUT, ':utf8';`? Don't use `Encode::_utf8_on($flagOn);</li> - it's an internal method of the [mod://Encode] module, and shouldn't be called from the outside. Use <c>$string = decode_utf8 $string;` instead. Now - occasionally! (which it also does under simple cgi not fast cgi operation and seems to be connected to overall load) - my upload crashes with CGI.pm (version 3.33) throwing "Malformed utf8" in apache's error.log What does "occasionally" mean? Does it alwyays die for the same set of data? And finally: does the data contain malfromed UTF8?	[reply] [d/l] [select]
Re^2: Getting mad with CGI::Application and utf8 by isync (Hermit) on Feb 26, 2008 at 11:51 UTC
At first, thanks for your reply! Did you check if CGI::Fast returns text strings (i.e. with UTF-8 flag set)? No, as I don't know spontaneously how. use open ':utf8' only affects open (iirc), so it's useless you unless open files. Ok. But, mh, what about caches, like File::Cache? Should I turn :utf8 for open on when I use these or will the module handle it internally? Do you use binmode STDOUT, ':utf8';? I did, actually I used `binmode STDOUT, ":encoding(utf8)";`. But then it seems to break CGI-Application-Plugin-CompressGzip! (more hair pulling..) Should I set it again? And, should I also set again `binmode STDIN, ":encoding(utf8)"` (or is this redundant with my `my $param_f = decode("utf8", $q->param("f") )` procedure?? Don't use Encode::_utf8_on($flagOn); - it's an internal method of the Encode module, and shouldn't be called from the outside. Use <c>$string = decode_utf8 $string; instead. I know. But I inspected the returned strings (`local $Data::Dumper::Useqq = 1;`) and found out they were properly formed utf8. Until they pass trought the final stages of CGI::Application, which broke them again. So I tried various solutions and found out that yes, the string was proper utf8 but without the flag on. When I switched it on manually, C::A left them alone and seem to pass it till the browser stage. (you see, I am deeply woven in trouble...) update:I now use decode_utf8 and it works just as good. Regarding testdata I tested with a 30K tar.gz, a 1K perl script and a 500K mp3 file - each the same problem, they sometimes come through..	[reply] [d/l] [select]
Re^3: Getting mad with CGI::Application and utf8 by moritz (Cardinal) on Feb 26, 2008 at 12:28 UTC
You can check if the UTF-8 flag is set with Devel::Peek, which can dump all the internal flags of a scalar varaible. There is also `utf8::is_utf8`, but I somehow suspect that the results might be subtly different (not sure yet, haven't really tried. Perhaps the difference that I thought I had noticed came from somewhere else.)	[reply] [d/l]
Re^4: Getting mad with CGI::Application and utf8 by isync (Hermit) on Feb 26, 2008 at 12:49 UTC
Re^4: Getting mad with CGI::Application and utf8 by Juerd (Abbot) on Feb 26, 2008 at 20:47 UTC
Re^5: Getting mad with CGI::Application and utf8 by moritz (Cardinal) on Feb 27, 2008 at 08:42 UTC
Some notes below your chosen depth have not been shown here
Re^2: Getting mad with CGI::Application and utf8 by Juerd (Abbot) on Feb 26, 2008 at 20:41 UTC
Did you check if CGI::Fast returns text strings (i.e. with UTF-8 flag set)? Please note that the absense of the UTF8 flag does not mean that the string is not a text string. Checking if the UTF8 flag is set, should be done only by people who know about Perl's Unicode internals. To all other people, it will only add to the confusion. Properly decode and encode, and all will be fine. (Though you may need an occasional utf8::upgrade, see also Unicode::Semantics.)	[reply]
Re: Getting mad with CGI::Application and utf8 by isync (Hermit) on Feb 26, 2008 at 11:58 UTC
Is there someone out there who has a complete script/receipt which gets CGI::Application working with utf8? I found this excellent post by Andrew which adresses the problem but I have not the module writing/namspace mapping skills to get his modules to work with CGI::Application. I tried... But it breaks too many things. C::A::Plugin::DBH uses DBI in an abstraction layer, so I would have to edit/write my own C::A::Plugin::DBH. Then, I use CGI::Fast which is an abstraction on top of CGI.pm, so I also need a new CGI::Fast, which I tried but had no luck... So many things...	[reply]
Re: Getting mad with CGI::Application and utf8 by Juerd (Abbot) on Feb 26, 2008 at 20:43 UTC
[moritz] already pointed out that you shouldn't do `_utf8_on`. The `:utf8` IO layer does the same thing and should also not be used. Using _utf8_on or :utf8 may result in serious malfunction and security holes.	[reply]
Re^2: Getting mad with CGI::Application and utf8 by Anonymous Monk on May 15, 2008 at 18:37 UTC
There is another aproach: one i use from time to time. Dont use perls borken utf-8 support. Pass the Argument -C0 to perl at the head of your script , and perl will treat all strings as raw binary. As long as you only break your strings on well defined boundaries, and dont attempt to naively count characters, it will work fine. If you need to truncate a string at a certain arbitrary point, such as "100 bytes" or so, you will have to make sure it stops at a well defined UTF-8 endpoint.	[reply]

Back to Seekers of Perl Wisdom