http://www.perlmonks.org?node_id=337508

linux454 has asked for the wisdom of the Perl Monks concerning the following question:

I'm having fits with some utf-8 stuff. This is probably due to the fact that I still haven't fully wrapped my head around the whole character encoding stuff. However here is my problem:

I have a cgi generating an html page with the following meta tag: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">. This causes my browser (Mozilla 1.6) to set it's character coding to utf8, when I try to submit a form with a text box containing an 'ö' (o with umlauts)(via method 'POST') it is encoded as %C3%B6 the cgi has issues decoding this to the proper character. When I change the character encoding in my browser to ISO-8859-1 and send it, the character is translated into %F6. What is the easiest way to handle these unicode characters that have been URL encoded? I've tried looking at the CGI.pm's decoding and it doesn't seem to work properly, I'm at a loss as to where to go next. At least a shove in the right direction would be greatly appreciated.

Replies are listed 'Best First'.
Re: UTF-8 and URL encoding
by saintmike (Vicar) on Mar 18, 2004 at 00:43 UTC
    If the browser submits the form data in UTF-8 and tells the CGI script so, is it really CGI.pm's job to decode it transparently?

    I'd say, it's up to the application to take care of that. You could use

    use Text::Iconv; my $converter = Text::Iconv->new("utf-8", "iso-8859-1"); my $decoded = $converter->convert(param('test'));

    to decode it if you know it's UTF-8 and it needs to be iso-8859.

Re: UTF-8 and URL encoding
by iguanodon (Priest) on Mar 18, 2004 at 02:37 UTC
    I have a cgi generating an html page with the following meta tag: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">. This causes my browser (Mozilla 1.6) to set it's character coding to utf8
    Does it? Have you checked in Mozilla to see that it really is using utf8 encoding? If your HTTP header is not sending utf8, then I don't believe the meta tag is doing what you expect. I'd set the encoding in the HTTP header by calling header(-charset=>'utf-8').

      Yes, it does. The rule is: GET and POST parameters are encoded in the charset of the page. Try this html page with your browser, alternatively with utf-8 and iso-8859-1:
      <head> <meta http-equiv="content-type" content="text/html; charset=iso-8859-1"> <!-- meta http-equiv="content-type" content="text/html; charset=utf-8" --> </head> <body> <script> document.write(location.search + "<br>"); </script> <form> <input type="hidden" name="test" value="&auml;"> <input type="submit"> </form> </body>
        Right, but that's static HTML. Try this and look at what your browser thinks the encoding is:

        #!/usr/local/bin/perl use strict; use warnings; use CGI qw(:standard); print header(); print <<HTML; <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> </head> <body> <script> document.write(location.search + "<br>"); </script> <form> <input type="hidden" name="test" value="&auml;"> <input type="submit"> </form> HTML print end_html();

        Unless you tell it not to, CGI.pm will set the encoding to iso-8859-1 in the HTTP header. In this case the meta tag has no effect, at least in recent versions of IE and Mozilla.

Re: UTF-8 and URL encoding
by iburrell (Chaplain) on Mar 17, 2004 at 22:02 UTC
    How does CGI.pm handle them URI encoded characters? My impression is that it will decode them to bytes. The result is a byte string that contains UTF-8 encoded characters. If you want Unicode strings, which will have the same bytes internally but Unicode behavior, you can use the Encode module to convert them.
    my $octets = $cgi->param('text'); my $string = decode("utf-8", $octets);
Re: UTF-8 and URL encoding
by linux454 (Pilgrim) on Mar 19, 2004 at 15:56 UTC
    Ok, after some more research, I think I have a better understanding of the situation, forgive me if I am stating the obvious, but this is for the chaps like myself. UTF-8 is not a character set, it is an encoding method for use with the UCS/Unicode character set which is a multi-byte charset. ISO-8859-1 is a Superset of US-ASCII (i.e. a single byte character set), though it is not an encoding method per se. In that these character sets map to single bytes so no magical encoding has to be done. The way UTF-8 works is thus:

    • UCS characters U+0000-U+007F are encoded as simple bytes, this allows for ASCII compatability
    • All UCS characters >U+007F are encoded as a sequence of bytes with their most significant byte set.
    • The first byte in a multibyte sequence is always in the range of 0xC0-0xFD, and indicates how many bytes follow for this character. All further bytes in the same sequence are in the range of 0x80-0xBF
    • All possible 231 UCS codes can be encoded
    • The bytes 0xFE & 0xFF are never used in UTF-8 encoding

    The following table describes the byte sequences used to represent a character.
    Unicode/UCS numberByte Sequence
    U+00000000-U+0000007F0xxxxxxxx
    U+00000080-U+000007FF110xxxxx 10xxxxxx
    U+00000800-U+0000FFFF1110xxxx 10xxxxxx 10xxxxxx
    U+00010000-U+001FFFFF11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    U+00200000-U+03FFFFFF111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    U+04000000-U+7FFFFFFF1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    The x bit positions are filled with the bits of the character's number in binary. The rightmost bit is the least-significant. Note that the number of leading one bits in the first byte is identical to the total number of bytes in the sequence.

    For example: The U+000000F6 (LATIN SMALL LETTER O WITH DIAERESIS 'ö') = 1111 0110
    Since 0xF6 is greater than 0x7F UTF-8 uses the second row of the above table to encode this character.

    110XXXXX 10XXXXXX = 0xC0 0X80 11000011 10110110 = 0xC3 0xB6

    This explains how %F6 is transcoded to %C3%B6. CGI.pm is placing single byte characters from the ISO-8859-1 characterset in place of the unicode two-byte character, which is expected. I can also run the string through a UTF-8 decoder and it will display the proper character, however if I display the string undecoded back to the browser, in UTF-8 mode it shows up as the wrong character (a chinese character). I expect if I want to process the string in perl and have the proper character in the string I would have to decode the two-bytes using a utf-8 decoder. However, I would not expect to have to decode the string, if I were just going to turn around and display it back to the browser which is in UTF-8 'mode'. Though when I decode the string it does display in the browser properly.

    Note:My source for all this new found UCS/Unicode knowledge came from http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs and some portions were copy and pasted, while others were paraphrased. Thanks to Markus Kuhn for his wonderful resource.

Re: UTF-8 and URL encoding
by qq (Hermit) on Mar 17, 2004 at 22:40 UTC

    I'm having fits with some utf-8 stuff. This is probably due to the fact that I still haven't fully wrapped my head around the whole character encoding stuff.

    Me too. If anyone has advice about what to read to get a decent understanding of practical encoding and conversion issues, I'd really appreciate it.

    qq

      The Perl XML FAQ has a section on Encodings which may offer some insights. There's also the venerable perlunicode manpage.