http://www.perlmonks.org?node_id=670272

isync has asked for the wisdom of the Perl Monks concerning the following question:

Monks,
I am in deep trouble here and are pulling my hair out.

I am trying to implement a web application that cleanly operates in utf8 with CGI::Application. Is it possible?

That's where I am:


update: I spawned another thread, as the fileupload problem seems to be losely connected to this problem.

Replies are listed 'Best First'.
Re: Getting mad with CGI::Application and utf8
by moritz (Cardinal) on Feb 26, 2008 at 10:49 UTC
    • Did you check if CGI::Fast returns text strings (i.e. with UTF-8 flag set)?
    • use open ':utf8' only affects open (iirc), so it's useless you unless open files.
    • Do you use binmode STDOUT, ':utf8';?
    • Don't use Encode::_utf8_on($flagOn);</li> - it's an internal method of the [mod://Encode] module, and shouldn't be called from the outside. Use <c>$string = decode_utf8 $string; instead.

    Now - occasionally! (which it also does under simple cgi not fast cgi operation and seems to be connected to overall load) - my upload crashes with CGI.pm (version 3.33) throwing "Malformed utf8" in apache's error.log

    What does "occasionally" mean? Does it alwyays die for the same set of data?

    And finally: does the data contain malfromed UTF8?

      At first, thanks for your reply!

      Did you check if CGI::Fast returns text strings (i.e. with UTF-8 flag set)?
      No, as I don't know spontaneously how.

      use open ':utf8' only affects open (iirc), so it's useless you unless open files.
      Ok. But, mh, what about caches, like File::Cache? Should I turn :utf8 for open on when I use these or will the module handle it internally?

      Do you use binmode STDOUT, ':utf8';?
      I did, actually I used binmode STDOUT, ":encoding(utf8)";. But then it seems to break CGI-Application-Plugin-CompressGzip! (more hair pulling..) Should I set it again? And, should I also set again binmode STDIN, ":encoding(utf8)" (or is this redundant with my my $param_f    = decode("utf8", $q->param("f") ) procedure??

      Don't use Encode::_utf8_on($flagOn); - it's an internal method of the Encode module, and shouldn't be called from the outside. Use <c>$string = decode_utf8 $string; instead.
      I know. But I inspected the returned strings (local $Data::Dumper::Useqq = 1;) and found out they were properly formed utf8. Until they pass trought the final stages of CGI::Application, which broke them again. So I tried various solutions and found out that yes, the string was proper utf8 but without the flag on. When I switched it on manually, C::A left them alone and seem to pass it till the browser stage. (you see, I am deeply woven in trouble...)

      update:I now use decode_utf8 and it works just as good.

      Regarding testdata
      I tested with a 30K tar.gz, a 1K perl script and a 500K mp3 file - each the same problem, they sometimes come through..
        You can check if the UTF-8 flag is set with Devel::Peek, which can dump all the internal flags of a scalar varaible.

        There is also utf8::is_utf8, but I somehow suspect that the results might be subtly different (not sure yet, haven't really tried. Perhaps the difference that I thought I had noticed came from somewhere else.)

      Did you check if CGI::Fast returns text strings (i.e. with UTF-8 flag set)?

      Please note that the absense of the UTF8 flag does not mean that the string is not a text string.

      Checking if the UTF8 flag is set, should be done only by people who know about Perl's Unicode internals. To all other people, it will only add to the confusion.

      Properly decode and encode, and all will be fine. (Though you may need an occasional utf8::upgrade, see also Unicode::Semantics.)

Re: Getting mad with CGI::Application and utf8
by isync (Hermit) on Feb 26, 2008 at 11:58 UTC
    Is there someone out there who has a complete script/receipt which gets CGI::Application working with utf8?

    I found this excellent post by Andrew which adresses the problem but I have not the module writing/namspace mapping skills to get his modules to work with CGI::Application.
    I tried... But it breaks too many things. C::A::Plugin::DBH uses DBI in an abstraction layer, so I would have to edit/write my own C::A::Plugin::DBH. Then, I use CGI::Fast which is an abstraction on top of CGI.pm, so I also need a new CGI::Fast, which I tried but had no luck... So many things...
Re: Getting mad with CGI::Application and utf8
by Juerd (Abbot) on Feb 26, 2008 at 20:43 UTC

    [moritz] already pointed out that you shouldn't do _utf8_on. The :utf8 IO layer does the same thing and should also not be used. Using _utf8_on or :utf8 may result in serious malfunction and security holes.

      There is another aproach: one i use from time to time. Dont use perls borken utf-8 support. Pass the Argument -C0 to perl at the head of your script , and perl will treat all strings as raw binary. As long as you only break your strings on well defined boundaries, and dont attempt to naively count characters, it will work fine. If you need to truncate a string at a certain arbitrary point, such as "100 bytes" or so, you will have to make sure it stops at a well defined UTF-8 endpoint.