Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

CGI::Application - Which is the proper way of handling and outputting utf8

by isync (Hermit)
on Nov 17, 2007 at 10:43 UTC ( #651403=perlquestion: print w/ replies, xml ) Need Help??
isync has asked for the wisdom of the Perl Monks concerning the following question:

Currently I use
use Encode; use utf8;
at the very start of my MyWebApp.pm module and a bit later I add $self->header_add(-charset => 'utf-8'); in the setup() suboutine. And it works...

But is this the correct way of doing it. So far I haven't handled form input data, etc. (which should be no other encoding, as I server utf-8 pages anyway!?).

I have seen different hacks to output utf8 with CGI.pm and utf8 (here and here).

Now, any insight? What is the correct way of declaring the cgi.pm object's content as utf8 and outputting it as such? Is there a CGI::Application switch to set, so that it handles the contents as utf8 and sets the header correctly?

Comment on CGI::Application - Which is the proper way of handling and outputting utf8
Select or Download Code
Re: CGI::Application - Which is the proper way of handling and outputting utf8
by Juerd (Abbot) on Nov 17, 2007 at 11:18 UTC

    CGI works with STDIN and STDOUT, so that is easy:

    binmode STDIN, ":encoding(utf8)"; binmode STDOUT, ":encoding(utf8)";
    However, $ENV{QUERY_STRING} might also need this treatment, but environment variables can't. This means you have to hack around it a little:
    my $cgi = $ENV{REQUEST_METHOD} eq 'GET' ? CGI->new(decode_utf8 $ENV{QUERY_STRING}) : CGI->new;

    (all code untested)

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      I don't think this will work because QUERY_STRING is %-encoded. If C::A does not perform character decoding itself, you'll have to post-process each parameter that C::A creates with the appropirate decode (e.g. decode_utf8).

        Good catch. Indeed passing a utf8_decoded query string to CGI->new makes no sense, and you need to decode each thing individually. Time for an encoding aware CGI.pm!

      Maybe better :raw:encoding(utf8)

        Why?

Re: CGI::Application - Which is the proper way of handling and outputting utf8
by rhesa (Vicar) on Nov 19, 2007 at 03:35 UTC
      Hi rhesa, Thank you for posting this. I'm using your code in my app with two efficiency tweaks/changes:
      1. check for the 'setting a param' case first
      2. use utf8::decode() instead of decode_utf8(), since due to a bug in Encode, decode_utf8() always sets the UTF8 flag, even for ASCII-only text. utf8::decode() doesn't set the UTF8 flag for this case, so the faster ASCII semantics can be used where possible. (Based on ikegami's comment below, maybe I should say "where safe" instead of "where possible"). See Behaviour of Encode::decode_utf8 on ASCII
      package CGI::as_utf8; # add UTF-8 decode capability to CGI.pm BEGIN { use strict; use warnings; use CGI 3.47; # earlier versions have a UTF-8 double-decoding bug { no warnings 'redefine'; my $param_org = \&CGI::param; my $might_decode = sub { my $p = shift; # make sure upload() filehandles are not modified return $p if !$p || ( ref $p && fileno($p) ); utf8::decode($p); # may fail, but only logs an error $p }; *CGI::param = sub { # setting a param goes through the original interface goto &$param_org if scalar @_ != 2; my $q = $_[0]; # assume object calls always my $p = $_[1]; return wantarray ? map { $might_decode->($_) } $q->$param_org($p) : $might_decode->( $q->$param_org($p) ); } } }

        so the faster ASCII semantics can be used where possible.

        Almost. The UTF8=1 format is still unnecessarily used if the string is "É", for example. You'd have to include the following after decoding if you wanted to always use the UTF8=0 format when possible.

        utf8::downgrade($p, 1);

        It's safer not to do that, though, as it affects \w*, uc()*, buggy XS, etc.

        * — \w and uc() are unaffected when using use 5.012; or use feature qw( unicode_strings );.

Re: CGI::Application - Which is the proper way of handling and outputting utf8
by isync (Hermit) on Nov 19, 2007 at 10:56 UTC
    OK, now I have modified my application to start with:

    use utf8; # to state that the script itself is in utf8
    binmode STDIN, ":encoding(utf8)";
    binmode STDOUT, ":encoding(utf8)";
    use as_utf8; # which is the hack posted at http://www.perlmonks.org/?node_id=651574

    in setup():
    $self->header_add(-charset => 'utf-8');

    Any final comments/thougts?
      With the 'as_utf8' hack, you do not need binmode STDIN, ":encoding(utf8)";. In fact, it will cause problems if you have a binary/file upload field in a web form, since it will try to UTF-8 decode the incoming binary file/data. The as_utf8 hack correctly only decodes text form field data.

      Instead of binmode STDOUT, ":encoding(utf8)";, you can put the following in cgiapp_postrun():

      sub cgiapp_postrun { # overrides my ($self, $output_ref) = @_; utf8::encode( $$output_ref ) if utf8::is_utf8($$output_ref) }
      This will only UTF-8 encode the output if it needs encoding -- i.e., only if it $output_ref contains non-ASCII characters.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://651403]
Approved by lima1
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2014-09-22 21:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (205 votes), past polls