UTF-8 and URL encoding

linux454 has asked for the wisdom of the Perl Monks concerning the following question:

I'm having fits with some utf-8 stuff. This is probably due to the fact that I still haven't fully wrapped my head around the whole character encoding stuff. However here is my problem:

I have a cgi generating an html page with the following meta tag: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">. This causes my browser (Mozilla 1.6) to set it's character coding to utf8, when I try to submit a form with a text box containing an 'ö' (o with umlauts)(via method 'POST') it is encoded as %C3%B6 the cgi has issues decoding this to the proper character. When I change the character encoding in my browser to ISO-8859-1 and send it, the character is translated into %F6. What is the easiest way to handle these unicode characters that have been URL encoded? I've tried looking at the CGI.pm's decoding and it doesn't seem to work properly, I'm at a loss as to where to go next. At least a shove in the right direction would be greatly appreciated.

Comment on UTF-8 and URL encoding Download Code

Replies are listed 'Best First'.

Re: UTF-8 and URL encoding
by saintmike (Vicar) on Mar 18, 2004 at 00:43 UTC

CGI.pm

I'd say, it's up to the application to take care of that. You could use

use Text::Iconv;
my $converter = Text::Iconv->new("utf-8", "iso-8859-1");
my $decoded = $converter->convert(param('test'));
[download]

to decode it if you know it's UTF-8 and it needs to be iso-8859.

[reply]
[d/l]

Re: UTF-8 and URL encoding
by iguanodon (Priest) on Mar 18, 2004 at 02:37 UTC

I have a cgi generating an html page with the following meta tag: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">. This causes my browser (Mozilla 1.6) to set it's character coding to utf8

header(-charset=>'utf-8')

[reply]
[d/l]

Re: Re: UTF-8 and URL encoding

by eserte (Deacon) on Mar 18, 2004 at 10:30 UTC

<head>
 <meta http-equiv="content-type"
       content="text/html; charset=iso-8859-1">
 <!-- meta http-equiv="content-type"
           content="text/html; charset=utf-8" -->
</head>
<body>
<script>
document.write(location.search + "<br>");
</script>
<form>
 <input type="hidden" name="test" value="&auml;">
 <input type="submit">
</form>
</body>
[download]

[reply]
[d/l]

Re: Re: Re: UTF-8 and URL encoding

by iguanodon (Priest) on Mar 18, 2004 at 13:23 UTC

#!/usr/local/bin/perl

use strict;
use warnings;

use CGI qw(:standard);

print header();

print <<HTML;
<html>
<head>
 <meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body>
<script>
document.write(location.search + "<br>");
</script>
<form>
 <input type="hidden" name="test" value="&auml;">
 <input type="submit">
</form>
HTML

print end_html();
[download]

Unless you tell it not to, CGI.pm will set the encoding to iso-8859-1 in the HTTP header. In this case the meta tag has no effect, at least in recent versions of IE and Mozilla.

[reply]
[d/l]

Re: UTF-8 and URL encoding
by iburrell (Chaplain) on Mar 17, 2004 at 22:02 UTC

my $octets = $cgi->param('text');
my $string = decode("utf-8", $octets);
[download]

[reply]
[d/l]

Re: UTF-8 and URL encoding
by linux454 (Pilgrim) on Mar 19, 2004 at 15:56 UTC

not

encoding method

not

encoding method

UCS characters U+0000-U+007F are encoded as simple bytes, this allows for ASCII compatability
All UCS characters >U+007F are encoded as a sequence of bytes with their most significant byte set.
The first byte in a multibyte sequence is always in the range of 0xC0-0xFD, and indicates how many bytes follow for this character. All further bytes in the same sequence are in the range of 0x80-0xBF
All possible 2³¹ UCS codes can be encoded
The bytes 0xFE & 0xFF are never used in UTF-8 encoding

The following table describes the byte sequences used to represent a character.

Unicode/UCS number Byte Sequence

U+00000000-U+0000007F 0xxxxxxxx

U+00000080-U+000007FF 110xxxxx 10xxxxxx

U+00000800-U+0000FFFF 1110xxxx 10xxxxxx 10xxxxxx

U+00010000-U+001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

U+00200000-U+03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U+04000000-U+7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The x bit positions are filled with the bits of the character's number in binary. The rightmost bit is the least-significant. Note that the number of leading one bits in the first byte is identical to the total number of bytes in the sequence.

For example: The U+000000F6 (LATIN SMALL LETTER O WITH DIAERESIS 'ö') = 1111 0110
Since 0xF6 is greater than 0x7F UTF-8 uses the second row of the above table to encode this character.

110XXXXX 10XXXXXX = 0xC0 0X80
11000011 10110110 = 0xC3 0xB6
[download]

This explains how %F6 is transcoded to %C3%B6. CGI.pm is placing single byte characters from the ISO-8859-1 characterset in place of the unicode two-byte character, which is expected. I can also run the string through a UTF-8 decoder and it will display the proper character, however if I display the string undecoded back to the browser, in UTF-8 mode it shows up as the wrong character (a chinese character). I expect if I want to process the string in perl and have the proper character in the string I would have to decode the two-bytes using a utf-8 decoder. However, I would not expect to have to decode the string, if I were just going to turn around and display it back to the browser which is in UTF-8 'mode'. Though when I decode the string it does display in the browser properly.

Note:My source for all this new found UCS/Unicode knowledge came from http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs and some portions were copy and pasted, while others were paraphrased. Thanks to Markus Kuhn for his wonderful resource.

[reply]
[d/l]
[select]

Re: UTF-8 and URL encoding
by qq (Hermit) on Mar 17, 2004 at 22:40 UTC

I'm having fits with some utf-8 stuff. This is probably due to the fact that I still haven't fully wrapped my head around the whole character encoding stuff.

Me too. If anyone has advice about what to read to get a decent understanding of practical encoding and conversion issues, I'd really appreciate it.

[reply]

Re: Re: UTF-8 and URL encoding

by grantm (Parson) on Mar 18, 2004 at 01:47 UTC

The Perl XML FAQ has a section on Encodings which may offer some insights. There's also the venerable perlunicode manpage.

[reply]

Back to Seekers of Perl Wisdom