Hi All,
After hitting issues with form input that contained no ASCII characters, such as £ I wrote a QnD script to try ans understand what is going on. I'm afraid I still don't fully understand :/
Code for the script is included at the bottom, it'll run on Linux or Windows, Apache/IIS/Others.
As far as I understand it:-
- The form input is being encoded as UTF-8 by the browser as the server has set a UTF-8 charset in it's headers
- When Perl CGI.pm picks it up, it has no idea it's UTF-8
- If it gets saved straight out to a file it'll still be in UTF-8 although the file itself may not
- If decoded with Encode.pm Perl will flag it as being UTF-8, but convert to it's own internal format
- If encoded with Encode.pm Perl will NOT flag it as being UTF-8, it'll actually be double encoded
- If you try to manipulate a UTF-8 string that hasn't been decoded, such as with a regexp, strange things might happen
Given this, I decided to use HTML::Entities to convert characters such as £ to £. This is where things got more confusing. The output of my test script is:-
Input: £ (IS UTF8? No)
Decoded: ? (IS UTF8? Yes)
Encoded: £ (IS UTF8? No)
Entities input: £
Entities decoded: £
Entities encoded: £
If I print the input straight back out it comes out as a normal £ as expected, if decoded it gets an unrecognised character symbol, encoded it has the tell tale  appear. But if I pass it through HTML::Entities, the input get's the  and the decoded one comes out right?? The encoded one, well that comes out even wierder.
On top of this, if you write these out to a file, and view using nano or vi you see:-
Input: £
Decoded: £
Encoded: £
Which didn't make sense to me, I expected the decoded one to be just £. But when I tested this script on Win32 IIS, i got:-
Input: £
Decoded: £
Encoded: £
Which is what I expected???
Maybe a UTF-8 expert could explain this? It might make a good reference.
Test script:-
#!/usr/bin/perl
use strict;
BEGIN {
print "content-type: text/html; charset=UTF-8\n\n";
use FindBin qw ($RealBin $RealScript);
use lib $FindBin::RealBin;
chdir $RealBin;
}#BEGIN
use CGI;
my $cgi = new CGI;
print qq~
<form method=POST>
input: <input type=text name=string value="${ \$cgi->param('string') }
+">
<input type=submit>
</form>
~;
if ( $cgi->param('string') ) {
use Encode qw( is_utf8 encode decode );
print "Input: ${ \$cgi->param('string') } (IS UTF8? ";
if ( is_utf8($cgi->param('string')) ) { print "Yes)<br>\n"; }
else { print "No)<br>\n"; }
my $string = decode("utf8", $cgi->param('string'));
print "Decoded: $string (IS UTF8? ";
if ( is_utf8($string) ) { print "Yes)<br>\n"; }
else { print "No)<br>\n"; }
my $octets = encode("utf8", $cgi->param('string'));
print "Encoded: $octets (IS UTF8? ";
if ( is_utf8($octets) ) { print "Yes)<br>\n"; }
else { print "No)<br>\n"; }
open( OUTF, '>utf8.txt' ) || print("Error writing file");
print OUTF "Input: ${ \$cgi->param('string') }\n";
print OUTF "Decoded: $string\n";
print OUTF "Encoded: $octets\n";
close( OUTF );
use HTML::Entities;
my $ent_input = encode_entities($cgi->param('string'));
print "Entities input: $ent_input<br>\n";
my $ent_decode = encode_entities($string);
print "Entities decoded: $ent_decode<br>\n";
my $ent_encode = encode_entities($octets);
print "Entities encoded: $ent_encode<br>\n";
}#if
Lyle
Update: Thanks everyone for the replies :)