http://www.perlmonks.org?node_id=814132


in reply to mod_perl2 and utf8

It all comes down to:

You can't output characters. You can only output bytes. If you want to output characters, you'll need to encode them somehow.

You didn't do that.

If I have a mod_perl2 handler which just sends text/plain utf8 content and send those two strings via $r->print then I see different results in the browser (strA doesn't render correctly). Note that $r->binmode seems to do nothing.

You've shown that $r->print's expects a string of bytes just like the builtin print. If you want to output characters, you need to encode them manually or by telling the object to do it for you (such as by using PerlIO layer :utf8 or :encoding on a file handle) first.

The only reason $strB works is that $r->print does the best it can with an invalid input. You should get "Wide character" warnings alerting you to that fact.

You said binmode doesn't work on $r, so that leaves you with the option of doing it manually.

Fix:

$r->print($strA); # XXX $r->print($strB); # XXX
should be
$r->print(Encode::encode_utf8($strA)); $r->print(Encode::encode_utf8($strB));
or
utf8::encode my $strA_utf8 = $strA; utf8::encode my $strB_utf8 = $strB; $r->print($strA); $r->print($strB);

Update: Adjusted phrasing

Replies are listed 'Best First'.
Re^2: mod_perl2 and utf8
by jbert (Priest) on Dec 23, 2009 at 18:18 UTC
    I would expect $r->print to accept a string of bytes just like every other print.

    The posted code shows that printing to STDOUT in a cmdline script gives the same result for both strings (because non-mod-perl2 STDOUT has an associated encoding (latin1 by default, changeable to utf8) - either will work if it matches your terminal).

    Printing to STDOUT (or using r->print) under apache does not have this property - you get different behaviour for the two approaches.

    i.e. the problem as I see it is that STDOUT under mod_perl2 lacks the utf8-awareness built through the rest of the perl I/O layer, with no way of enabling it.

    Yes, you can manually encode, but that can necessitate additional copying (you can pass $r as an output destination to TT, which will do the wrong thing if you're working with unicode strings, since it won't call encode. Yes, you can build to a scalar, encode that and print that but it's a shame the extra copy is needed when perl has a mechanism for this which isn't being used.

      I would expect $r->print to accept a string of bytes just like every other print.

      The posted code shows that printing to STDOUT in a cmdline script gives the same result for both strings

      I followed up by saying you can instruct print to accept characters by telling it how to handle them. This is done on a per-handle basis, and that's what you did for STDOUT with

      binmode STDOUT, ':utf8';

      You need to do something equivalent with mod_perl's object.

      the problem as I see it is that STDOUT under mod_perl2 lacks the utf8-awareness built through the rest of the perl I/O layer, with no way of enabling it.

      Not knowing anything about the class except what you've told me, I agree. File a bug report.