jbrugger has asked for the wisdom of the Perl Monks concerning the following question:

Ok, now i actually might have found a bug :)

Some time ago, we changed our total system to use utf-8 encoding, but now we encounter the following problems:

The first time a call is made to the server, a string is displayed correctly, but the second time, you get the weird utf-8 chars on the screen, that only show properly if you put your browser decoding back to latin1. It seems it has to do with an issue with APR::Table, that might be used by modperl to store it's data: see this link for possible details.

To overcome the problem, we now set apache to use:
PerlResponseHandler ModPerl::PerlRun
in stead of the preferred
PerlResponseHandler ModPerl::Registry
The bug does not occur now, but the advantage of using modperl is mostly lost.

Does anyone have seen the same issue and have a acceptable solution?


"We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise." - Larry Wall.

Replies are listed 'Best First'.
Re: utf8 problems
by clinton (Priest) on Jun 01, 2006 at 13:07 UTC
    The problem is indeed related to APR::Table, and is as follows:

    • APR::Table is a perl interface to the underlying C code in Apache.
    • The UTF8 flag (see perlunicode) which marks a string as being UTF8 encoded is internal to Perl
    • There is no space in the C code to store the flag
    So you generate the string with the UTF8 flag set correctly, put it into APR::Table and it loses its flag. So when you retrieve it, you are receiving the individual bytes rather than the multibye character.

    The solution to this is

    use Encode; my $bytes = get_data_from_APR_TABLE(); my $characters = decode('utf8',$bytes);
    This way, you reset the flag and all will be happy again.

    The issue is that you need to know which strings are UTF8 before you decode them. I find it easier to make sure that everything that comes into my app all gets converted to UTF8, so I'm only ever dealing with one character set internally.

Re: utf8 problems
by bpphillips (Friar) on Jun 01, 2006 at 12:40 UTC
    The description of your problem is a little vague. Naturally, a little of the actual code that demonstrates the problem would be very helpful. Specifically, I'd be interested in knowing:
    - What happens in your code that qualifies an HTTP request as "the first time a call is made" vs. "the second time"
    - Where does said "string" come from and are you explicitly storing it somewhere in between the "first" and "second" calls
    - What character encoding are you sending to the browser as part of the Content-type HTTP header
    - What do you mean when you say that "utf-8 chars on the screen ... only show properly if you put your browser decoding back to latin1" (more specifically, how do you know they're utf-8 characters if they display correctly when you tell your browser they're latin1?)
    - Have you tried using Encode::_utf8_on() Encode::decode_utf8() to convert the data from a bytestring to the UTF-8 data that bytestring represents?
    - Have you compared Devel::Peek::Dump() output on the string from the "first" to the "second" calls to see what is different about the underlying data
    - UPDATE: Are you explicitly storing/retrieving something in an APR::Table or just assuming that mod_perl is using it in some way behind the scenes

    Hopefully those items will help us all figure out what's causing your problem!

    -- Brian
Re: utf8 problems
by jbrugger (Parson) on Jun 02, 2006 at 12:44 UTC
    Thanks for both answeres, yes the problem is vague, and vague to describe...
    We have as string and have the internal utf8 flag switched to on:
    # my example may be any string from a database /user input etc. my $example = "just a string that i nd to be utf8 encoded, but can't + see it in the chars, so guess::encode won't work"; Encode::_utf8_on($example); # ... use open ':utf8'; use open ':std'; #... my $cgi = new CGI; print $cgi->header( -type => 'text/html', -expires => '-1d', -cookie => [$cookie], -charset => 'UTF-8', ) print $example;
    Now the first time it's called using mod::perl regestry it prints:
    just a string that i nd to be utf8 encoded, but can't see it in the +chars, so guess::encode won't work
    but the second time:
    just a string that i nééd to be utf8 encoded, but can't see it in th +e chars, so guess::encode won't work
    This is weird, and i have no controll over how mod perl internally stores it's values.

    "We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise." - Larry Wall.
      Two things you should check to make this example work how you're attempting.

      - Is your file UTF-8 encoded (I usually use the *NIX file command or check VI's :set fileencoding to verify this -- although there may be other ways to do this)
      - Do you have a use utf8 at the beginning of your script?

      Whenever you're using UTF-8 content within the body of your script (as you're doing in your example at least) you need to make sure you tell perl that it should use character semantics rather than byte semantics on that data. This is accomplished by placing a use utf8 within the lexical scope that you're using UTF-8 data. This also makes it unnecessary to perform the Encode::_utf8_on() operation.

      However, as noted in bold in the utf8 docs: "Do not use this pragma for anything else than telling Perl that your script is written in UTF-8". If you're retrieving data from a GET/POST parameter or from a database, it's a different story.