Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Hi All,
  After hitting issues with form input that contained no ASCII characters, such as £ I wrote a QnD script to try ans understand what is going on. I'm afraid I still don't fully understand :/
  Code for the script is included at the bottom, it'll run on Linux or Windows, Apache/IIS/Others.
  As far as I understand it:-
  • The form input is being encoded as UTF-8 by the browser as the server has set a UTF-8 charset in it's headers
  • When Perl CGI.pm picks it up, it has no idea it's UTF-8
  • If it gets saved straight out to a file it'll still be in UTF-8 although the file itself may not
  • If decoded with Encode.pm Perl will flag it as being UTF-8, but convert to it's own internal format
  • If encoded with Encode.pm Perl will NOT flag it as being UTF-8, it'll actually be double encoded
  • If you try to manipulate a UTF-8 string that hasn't been decoded, such as with a regexp, strange things might happen
Given this, I decided to use HTML::Entities to convert characters such as £ to £. This is where things got more confusing. The output of my test script is:-
Input: (IS UTF8? No) Decoded: ? (IS UTF8? Yes) Encoded: £ (IS UTF8? No) Entities input: £ Entities decoded: Entities encoded: £
If I print the input straight back out it comes out as a normal £ as expected, if decoded it gets an unrecognised character symbol, encoded it has the tell tale appear. But if I pass it through HTML::Entities, the input get's the and the decoded one comes out right?? The encoded one, well that comes out even wierder.
  On top of this, if you write these out to a file, and view using nano or vi you see:-
Input: £ Decoded: £ Encoded: £
Which didn't make sense to me, I expected the decoded one to be just . But when I tested this script on Win32 IIS, i got:-
Input: £ Decoded: Encoded: £
Which is what I expected???

Maybe a UTF-8 expert could explain this? It might make a good reference.

Test script:-
#!/usr/bin/perl use strict; BEGIN { print "content-type: text/html; charset=UTF-8\n\n"; use FindBin qw ($RealBin $RealScript); use lib $FindBin::RealBin; chdir $RealBin; }#BEGIN use CGI; my $cgi = new CGI; print qq~ <form method=POST> input: <input type=text name=string value="${ \$cgi->param('string') } +"> <input type=submit> </form> ~; if ( $cgi->param('string') ) { use Encode qw( is_utf8 encode decode ); print "Input: ${ \$cgi->param('string') } (IS UTF8? "; if ( is_utf8($cgi->param('string')) ) { print "Yes)<br>\n"; } else { print "No)<br>\n"; } my $string = decode("utf8", $cgi->param('string')); print "Decoded: $string (IS UTF8? "; if ( is_utf8($string) ) { print "Yes)<br>\n"; } else { print "No)<br>\n"; } my $octets = encode("utf8", $cgi->param('string')); print "Encoded: $octets (IS UTF8? "; if ( is_utf8($octets) ) { print "Yes)<br>\n"; } else { print "No)<br>\n"; } open( OUTF, '>utf8.txt' ) || print("Error writing file"); print OUTF "Input: ${ \$cgi->param('string') }\n"; print OUTF "Decoded: $string\n"; print OUTF "Encoded: $octets\n"; close( OUTF ); use HTML::Entities; my $ent_input = encode_entities($cgi->param('string')); print "Entities input: $ent_input<br>\n"; my $ent_decode = encode_entities($string); print "Entities decoded: $ent_decode<br>\n"; my $ent_encode = encode_entities($octets); print "Entities encoded: $ent_encode<br>\n"; }#if

Lyle

Update: Thanks everyone for the replies :)

In reply to UTF-8: Trying to make sense of form input by cosmicperl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others wandering the Monastery: (6)
    As of 2014-12-18 00:49 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (41 votes), past polls