Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

UTF-8: Trying to make sense of form input

by cosmicperl (Chaplain)
on Aug 15, 2009 at 16:44 UTC ( #788911=perlquestion: print w/ replies, xml ) Need Help??
cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
  After hitting issues with form input that contained no ASCII characters, such as £ I wrote a QnD script to try ans understand what is going on. I'm afraid I still don't fully understand :/
  Code for the script is included at the bottom, it'll run on Linux or Windows, Apache/IIS/Others.
  As far as I understand it:-
  • The form input is being encoded as UTF-8 by the browser as the server has set a UTF-8 charset in it's headers
  • When Perl CGI.pm picks it up, it has no idea it's UTF-8
  • If it gets saved straight out to a file it'll still be in UTF-8 although the file itself may not
  • If decoded with Encode.pm Perl will flag it as being UTF-8, but convert to it's own internal format
  • If encoded with Encode.pm Perl will NOT flag it as being UTF-8, it'll actually be double encoded
  • If you try to manipulate a UTF-8 string that hasn't been decoded, such as with a regexp, strange things might happen
Given this, I decided to use HTML::Entities to convert characters such as £ to £. This is where things got more confusing. The output of my test script is:-
Input: (IS UTF8? No) Decoded: ? (IS UTF8? Yes) Encoded: £ (IS UTF8? No) Entities input: £ Entities decoded: Entities encoded: £
If I print the input straight back out it comes out as a normal £ as expected, if decoded it gets an unrecognised character symbol, encoded it has the tell tale appear. But if I pass it through HTML::Entities, the input get's the and the decoded one comes out right?? The encoded one, well that comes out even wierder.
  On top of this, if you write these out to a file, and view using nano or vi you see:-
Input: £ Decoded: £ Encoded: £
Which didn't make sense to me, I expected the decoded one to be just . But when I tested this script on Win32 IIS, i got:-
Input: £ Decoded: Encoded: £
Which is what I expected???

Maybe a UTF-8 expert could explain this? It might make a good reference.

Test script:-
#!/usr/bin/perl use strict; BEGIN { print "content-type: text/html; charset=UTF-8\n\n"; use FindBin qw ($RealBin $RealScript); use lib $FindBin::RealBin; chdir $RealBin; }#BEGIN use CGI; my $cgi = new CGI; print qq~ <form method=POST> input: <input type=text name=string value="${ \$cgi->param('string') } +"> <input type=submit> </form> ~; if ( $cgi->param('string') ) { use Encode qw( is_utf8 encode decode ); print "Input: ${ \$cgi->param('string') } (IS UTF8? "; if ( is_utf8($cgi->param('string')) ) { print "Yes)<br>\n"; } else { print "No)<br>\n"; } my $string = decode("utf8", $cgi->param('string')); print "Decoded: $string (IS UTF8? "; if ( is_utf8($string) ) { print "Yes)<br>\n"; } else { print "No)<br>\n"; } my $octets = encode("utf8", $cgi->param('string')); print "Encoded: $octets (IS UTF8? "; if ( is_utf8($octets) ) { print "Yes)<br>\n"; } else { print "No)<br>\n"; } open( OUTF, '>utf8.txt' ) || print("Error writing file"); print OUTF "Input: ${ \$cgi->param('string') }\n"; print OUTF "Decoded: $string\n"; print OUTF "Encoded: $octets\n"; close( OUTF ); use HTML::Entities; my $ent_input = encode_entities($cgi->param('string')); print "Entities input: $ent_input<br>\n"; my $ent_decode = encode_entities($string); print "Entities decoded: $ent_decode<br>\n"; my $ent_encode = encode_entities($octets); print "Entities encoded: $ent_encode<br>\n"; }#if

Lyle

Update: Thanks everyone for the replies :)

Comment on UTF-8: Trying to make sense of form input
Select or Download Code
Re: UTF-8: Trying to make sense of form input
by Anonymous Monk on Aug 16, 2009 at 00:51 UTC
    Encode
    is_utf8(STRING [, CHECK])

    [INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If CHECK is true, also checks the data in STRING for being well-formed UTF-8. Returns true if successful, false otherwise.

    CGI
    -utf8

    This makes CGI.pm treat all parameters as UTF-8 strings. Use this with care, as it will interfere with the processing of binary uploads. It is better to manually select which fields are expected to return utf-8 strings and convert them using code like this:

    use Encode; my $arg = decode utf8=>param('foo');
    Maybe you want Encode::Guess
Re: UTF-8: Trying to make sense of form input
by ikegami (Pope) on Aug 16, 2009 at 01:31 UTC

    Don't look at is_utf8. That's going down the wrong path.

    Given this, I decided to use HTML::Entities to convert characters such as £ to &pound;. This is where things got more confusing.

    If param foo is encoded using UTF-8 and consists of text with HTML entities, you want

    my $text = decode_entities(decode('UTF-8', $cgi->param('foo')));

    Don't forget to encode the result if you output it in part of full (using encode or binmode :encoding on the output handle).

      In my opinion, it's futile to troubleshoot UTF-8 issues without understanding the underlying implementation and keeping track of the SVf_UTF8 flag, using Encode::is_utf8() when convenient and Devel::Peek's Dump() when necessary.

      The interface for Encode::is_utf8() is dreadful, but it's better than flailing in the dark.

        Yes, it can be useful in debugging when the flag matters. In this case, it only served to be a distraction. Thinking in terms of the UTF8 flag is the wrong way to go. Thinking in terms of encoded or not would have avoided all his problems.

        • param returns encoded chars.
        • decoded_entities accepts decoded chars.
        • decoded_entities returns decoded chars.
        • print without :encoding accepts encoded chars.

        Therefore, he needs to decode what param returns and encode what he prints.

        Using is_utf8 gives an idea whether the characters are decoded or not, but it's not reliable. In fact, it's specifically unreliable with decoded_entities since the string decoded_entities returns can have either state for the UTF8 flag. Documentation and Hungarian Notation are better tools here than is_utf8.

        Update: Fixed ambiguous pronouns. Fixed bad grammar. Fixed formatting.

Re: UTF-8: Trying to make sense of form input
by graff (Chancellor) on Aug 16, 2009 at 02:42 UTC
    When you intend to send utf8 character data back to the client browser from a cgi script, you really should do this at the very start:
    binmode STDOUT, ":utf8";
    and make sure that strings coming from the script itself, or from server-side resources (files, database or whatever) are likewise properly flagged as (known to be) utf8 strings.

    When you are getting utf8 characters in form data from the client browser, you have to use the Encode module -- decode('utf8',$cgi->param('foo')) -- as indicated in the replies above, so that the parameter value will be treated correctly by perl as a utf8 string.

Re: UTF-8: Trying to make sense of form input
by Nigel Peck (Initiate) on Sep 17, 2009 at 21:19 UTC
    For what it's worth, I've been struggling with a very similar problem for ages, and in the end it appears that HTML::Entities was causing my problem. Since you're using it here, have a look at that. It encodes the characters directly (using char()) I believe, and I don't think it supports UTF8. I could be wrong, but that's what was causing my problem.

      I don't think it supports UTF8.

      I think you mean UTF-8. UTF-8 is a character encoding. It's a means of converting characters to and from bytes for use in mediums that don't have a concept of characters.

      HTML::Entities works with characters, not bytes that were characters before they were encoded. It doesn't know anything of any character encoding (like UTF-8) since it only works with characters.

      The HTML portions you pass to decode_entities must first be decoded from bytes into characters (based on the encoding specified in the Content-Type header).

      Similarly, the HTML portions you receive from encode_entities must then be encoded from characters into bytes to characters (based on the encoding specified in the Content-Type header).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://788911]
Approved by zwon
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (17)
As of 2014-12-18 20:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (61 votes), past polls