Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

CGI.pm and encoding HTML entities in param()

by skazat (Hermit)
on Jun 17, 2008 at 21:21 UTC ( #692589=perlquestion: print w/ replies, xml ) Need Help??
skazat has asked for the wisdom of the Perl Monks concerning the following question:

I'm finding that vars that I pull from the CGI.pm's param method are encoded, as if they've been run through HML::Entities. Ala:

my $foo = $q->param('foo');

I couldn't find the entity encoding spec'd in the CGI.pm docs - is this something new? This is becoming a problem, as I'm double-encoding my HTML::Entities. I've been working with CGI.pm for about 9 years now and this is surprising to find the behavior change.

I'm having a hard time making a simple example from the large behemoth of a problem that's showing this problem, but I will continue to try to get one :)

Comment on CGI.pm and encoding HTML entities in param()
Download Code
Re: CGI.pm and encoding HTML entities in param()
by almut (Canon) on Jun 17, 2008 at 22:01 UTC

    What does the respective raw query look like when you encounter the problem, i.e. $ENV{QUERY_STRING} with GET requests? In other words, are you sure it's CGI.pm that's responsible for the encoding?

    Also, as you describe things, it sounds as if this is a new phenomenon. So, has there been a version change of CGI.pm lately? Which version is it, anyway?

Re: CGI.pm and encoding HTML entities in param()
by pc88mxer (Vicar) on Jun 18, 2008 at 01:29 UTC
    This is probably not CGI's doing but your browser's. For instance, if the charset of your pages is iso-8859-1, and you enter a non-latin1 character (like ā) into a text field, Firefox will represent the character in entity form (ā). This is essentially the best it can do since there is no way to represent a non-latin1 character in the latin1 encoding. This situation is explained well in the following article:

    Character Conversions from Browser to Database

    As for CGI, the values returned by param() are byte strings, not code-point strings. Due to the way the web standards evolved there just isn't enough information in the request to convert the parameter values to code-points. So this is something your application has to do based on what it knows about the encoding of the forms and web pages that will be calling it.

    This thread sheds some additional light on the problem: CGI::Application - Which is the proper way of handling and outputting utf8. As Juerd notes, it would be helpful if CGI was (or could be made) encoding aware so that parameter values could automatically be passed through a decoding function.

    A good way to help avoid character encoding problems is to 1) always explicitly specify the charset of your pages, and 2) settle on one encoding that can handle everything, e.g. UTF-8.

      My Goodness, I think you're right. Thanks for such an eloquent reply! Charsets And Encoding aren't my favorite gremlins to attempt to solve, especially in 8+ year code! - s

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://692589]
Approved by GrandFather
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (9)
As of 2014-08-01 03:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (256 votes), past polls