Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

CGI::Application - Which is the proper way of handling and outputting utf8

by isync (Hermit)
on Nov 17, 2007 at 10:43 UTC ( #651403=perlquestion: print w/replies, xml ) Need Help??

isync has asked for the wisdom of the Perl Monks concerning the following question:

Currently I use
use Encode; use utf8;
at the very start of my module and a bit later I add $self->header_add(-charset => 'utf-8'); in the setup() suboutine. And it works...

But is this the correct way of doing it. So far I haven't handled form input data, etc. (which should be no other encoding, as I server utf-8 pages anyway!?).

I have seen different hacks to output utf8 with and utf8 (here and here).

Now, any insight? What is the correct way of declaring the object's content as utf8 and outputting it as such? Is there a CGI::Application switch to set, so that it handles the contents as utf8 and sets the header correctly?

Replies are listed 'Best First'.
Re: CGI::Application - Which is the proper way of handling and outputting utf8
by Juerd (Abbot) on Nov 17, 2007 at 11:18 UTC

    CGI works with STDIN and STDOUT, so that is easy:

    binmode STDIN, ":encoding(utf8)"; binmode STDOUT, ":encoding(utf8)";
    However, $ENV{QUERY_STRING} might also need this treatment, but environment variables can't. This means you have to hack around it a little:
    my $cgi = $ENV{REQUEST_METHOD} eq 'GET' ? CGI->new(decode_utf8 $ENV{QUERY_STRING}) : CGI->new;

    (all code untested)

    Juerd # { site => '', do_not_use => 'spamtrap', perl6_server => 'feather' }

      I don't think this will work because QUERY_STRING is %-encoded. If C::A does not perform character decoding itself, you'll have to post-process each parameter that C::A creates with the appropirate decode (e.g. decode_utf8).

        Good catch. Indeed passing a utf8_decoded query string to CGI->new makes no sense, and you need to decode each thing individually. Time for an encoding aware!

      Maybe better :raw:encoding(utf8)


Re: CGI::Application - Which is the proper way of handling and outputting utf8
by davebaker (Pilgrim) on Oct 13, 2020 at 02:24 UTC

    This is just a follow-up to this decade-old thread -- I was able to get my scripts to work well with UTF-8 input by using the current version of, which is able to automatically decode the incoming "param" data, assuming it was encoded as UTF-8 when it was sent to the script, by using a '-utf8' pragma. So, instead of

       use CGI;

    one uses:

       use CGI ('-utf8');

    That pragma seems to have eliminated the need for the "as_utf8" modification discussed in this thread long ago, for scripts using CGI::Application. I have quite a few CGI::Application scripts still running, so I needed a way to pass the -utf8 pragma to (used internally by CGI::Application) without changing the CGI::Application module's code. The solution was to add the following subroutine to each application module. It overrides the cgiapp_get_query method in the CGI::Application parent.

    sub cgiapp_get_query { my $self = shift; use CGI ('-utf8'); my $q = CGI->new; $q->charset('UTF-8'); return $q; }

    The "$q->charset('UTF-8')" line is another matter. It isn't part of the automatic decoding of the param data. It's affecting the output. It causes CGI::Application to modify the header that's automatically generated at the end of a runmode just before the content is displayed, i.e., it becomes Content-type: text/html; charset=UTF-8

    I don't believe the charset method was part of back in the day, so maybe the "$q->charset('UTF-8')" doesn't do anything that isn't alternatively done by the technique in the code submitted by the OP, namely including this line in the "sub setup { ... }" in the application module:

       $self->header_add( -charset => 'utf-8' );

    To print a UTF-8 encoded web page requires that I actually encode the native Perl character format into UTF-8 by having this line be somewhere towards the top of the application module (i.e., not inside a subroutine):

       binmode STDOUT, ':encoding(UTF-8)';

    Or, instead of a binmode statement I can add

       use Encode;

    to the application module and then say, at the end of each runmode:

       return Encode::encode( 'UTF-8', $template->output );

    rather than merely saying

       return $template->output;.

    The "use utf8;" included in the code submitted by the OP isn't needed for the automatic decoding of the incoming param data or the encoding of the output.

Re: CGI::Application - Which is the proper way of handling and outputting utf8
by rhesa (Vicar) on Nov 19, 2007 at 03:35 UTC
      Hi rhesa, Thank you for posting this. I'm using your code in my app with two efficiency tweaks/changes:
      1. check for the 'setting a param' case first
      2. use utf8::decode() instead of decode_utf8(), since due to a bug in Encode, decode_utf8() always sets the UTF8 flag, even for ASCII-only text. utf8::decode() doesn't set the UTF8 flag for this case, so the faster ASCII semantics can be used where possible. (Based on ikegami's comment below, maybe I should say "where safe" instead of "where possible"). See Behaviour of Encode::decode_utf8 on ASCII
      package CGI::as_utf8; # add UTF-8 decode capability to BEGIN { use strict; use warnings; use CGI 3.47; # earlier versions have a UTF-8 double-decoding bug { no warnings 'redefine'; my $param_org = \&CGI::param; my $might_decode = sub { my $p = shift; # make sure upload() filehandles are not modified return $p if !$p || ( ref $p && fileno($p) ); utf8::decode($p); # may fail, but only logs an error $p }; *CGI::param = sub { # setting a param goes through the original interface goto &$param_org if scalar @_ != 2; my $q = $_[0]; # assume object calls always my $p = $_[1]; return wantarray ? map { $might_decode->($_) } $q->$param_org($p) : $might_decode->( $q->$param_org($p) ); } } }

        so the faster ASCII semantics can be used where possible.

        Almost. The UTF8=1 format is still unnecessarily used if the string is "É", for example. You'd have to include the following after decoding if you wanted to always use the UTF8=0 format when possible.

        utf8::downgrade($p, 1);

        It's safer not to do that, though, as it affects \w*, uc()*, buggy XS, etc.

        * — \w and uc() are unaffected when using use 5.012; or use feature qw( unicode_strings );.

Re: CGI::Application - Which is the proper way of handling and outputting utf8
by isync (Hermit) on Nov 19, 2007 at 10:56 UTC
    OK, now I have modified my application to start with:

    use utf8; # to state that the script itself is in utf8
    binmode STDIN, ":encoding(utf8)";
    binmode STDOUT, ":encoding(utf8)";
    use as_utf8; # which is the hack posted at

    in setup():
    $self->header_add(-charset => 'utf-8');

    Any final comments/thougts?
      With the 'as_utf8' hack, you do not need binmode STDIN, ":encoding(utf8)";. In fact, it will cause problems if you have a binary/file upload field in a web form, since it will try to UTF-8 decode the incoming binary file/data. The as_utf8 hack correctly only decodes text form field data.

      Instead of binmode STDOUT, ":encoding(utf8)";, you can put the following in cgiapp_postrun():

      sub cgiapp_postrun { # overrides my ($self, $output_ref) = @_; utf8::encode( $$output_ref ) if utf8::is_utf8($$output_ref) }
      This will only UTF-8 encode the output if it needs encoding -- i.e., only if it $output_ref contains non-ASCII characters.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://651403]
Approved by lima1
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2021-06-17 09:43 GMT
Find Nodes?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)

    Results (83 votes). Check out past polls.