Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: UTF-8: Trying to make sense of form input

by ikegami (Pope)
on Aug 16, 2009 at 01:31 UTC ( #788955=note: print w/ replies, xml ) Need Help??


in reply to UTF-8: Trying to make sense of form input

Don't look at is_utf8. That's going down the wrong path.

Given this, I decided to use HTML::Entities to convert characters such as £ to £. This is where things got more confusing.

If param foo is encoded using UTF-8 and consists of text with HTML entities, you want

my $text = decode_entities(decode('UTF-8', $cgi->param('foo')));

Don't forget to encode the result if you output it in part of full (using encode or binmode :encoding on the output handle).


Comment on Re: UTF-8: Trying to make sense of form input
Select or Download Code
Replies are listed 'Best First'.
Re^2: UTF-8: Trying to make sense of form input
by creamygoodness (Curate) on Aug 16, 2009 at 03:58 UTC
    In my opinion, it's futile to troubleshoot UTF-8 issues without understanding the underlying implementation and keeping track of the SVf_UTF8 flag, using Encode::is_utf8() when convenient and Devel::Peek's Dump() when necessary.

    The interface for Encode::is_utf8() is dreadful, but it's better than flailing in the dark.

      Yes, it can be useful in debugging when the flag matters. In this case, it only served to be a distraction. Thinking in terms of the UTF8 flag is the wrong way to go. Thinking in terms of encoded or not would have avoided all his problems.

      • param returns encoded chars.
      • decoded_entities accepts decoded chars.
      • decoded_entities returns decoded chars.
      • print without :encoding accepts encoded chars.

      Therefore, he needs to decode what param returns and encode what he prints.

      Using is_utf8 gives an idea whether the characters are decoded or not, but it's not reliable. In fact, it's specifically unreliable with decoded_entities since the string decoded_entities returns can have either state for the UTF8 flag. Documentation and Hungarian Notation are better tools here than is_utf8.

      Update: Fixed ambiguous pronouns. Fixed bad grammar. Fixed formatting.

        I think you're right that the OP needs to grasp the mental model you've laid out.

        But I predict that until the OP masters debugging the encoding -- which requires understanding the role of the UTF8 flag -- problems are going to keep cropping up. If there were an "encoded/decoded" flag that you could check, that would be lovely. Since no such flag exists, you need to be able to look at the raw string and the presence/absence of the UTF8 flag in Devel::Peek to see what's going wrong.

        There are simply too many opportunities to mess up. Forget a binmode() here, omit (or include) a -utf8 argument there, forget to set pg_enable_utf8 on your DBD::Pg db handle, pass something through YAML::Syck without setting $YAML::Syck::ImplicitUnicode, and so on.

        In short... documentation and Hungarian notation are too unreliable :) -- because the underlying system is too hard to control from a high level.

        IMO, the only way to achieve high reliability for UTF-8 is to write tests.

        use Test::More tests => 1; my $smiley = "\x{263a}; my $maybe = round_trip($smiley); is( $maybe, $smiley, "String survives round trip including UTF8 flag" );

        PS: You updated your node multiple times over the half hour or so after it was posted, forcing me to keep rewriting my reply. :(

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://788955]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (17)
As of 2015-07-28 18:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (258 votes), past polls