Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Differences in UTF-8 html form

by Realbot (Scribe)
on Jan 09, 2005 at 12:33 UTC ( #420673=perlquestion: print w/replies, xml ) Need Help??

Realbot has asked for the wisdom of the Perl Monks concerning the following question:

I'm having some problems with a web application of mine.
To make things clearer here is an html input form which shows it. The form inputs two strings with GET and POST and, BTW, uses HTML::Mason.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>Test utf</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" +> </head> <body> <form name="formutfget" method="GET"> Enter text (get):<br> <input type="text" name="textget" size="20" maxlength="30"> </form> <form name="formutfpost" method="POST"> Enter text (post):<br> <input type="text" name="textpost" size="20" maxlength="30"> </form> Value of GET: <% $textget %><br> Hex of GET: <% $hexget %><br> Value of POST: <% $textpost %><br> Hex of POST: <% $hexpost %><br> </body> </html> <%args> $textget => '' $textpost => '' $hexget => '' $hexpost => '' </%args> <%init> $hexget = unpack('H*', $textget); $hexpost = unpack('H*', $textpost); </%init>
The strange thing is that running this form under these environments

Debian Woody - perl 5.6.1 - Mozilla 1.4.3/Firefox 1.0
Debian Sid - perl 5.8.4 - Mozilla 1.4.3/Firefox 1.0

using as input the string "Δωδεκανήσων", I get

Value of GET: Δωδεκανήσων
Hex of GET:
26233931363b26233936393b26233934383b26233934393b26233935343b2623393435 +3b26233935373b26233934323b26233936333b26233936393b26233935373b
Value of POST: Δωδεκανήσων
Hex of POST:
26233931363b26233936393b26233934383b26233934393b26233935343b2623393435 +3b26233935373b26233934323b26233936333b26233936393b26233935373b

while in OpenBSD 3.3 - perl 5.8.0 - Mozilla 1.4.3/Firefox 1.0 with the same input string I get

Value of GET: Δωδεκανήσων
Hex of GET: ce94cf89ceb4ceb5cebaceb1cebdceaecf83cf89cebd
Value of POST: Δωδεκανήσων
Hex of POST: ce94cf89ceb4ceb5cebaceb1cebdceaecf83cf89cebd

So, it seems that in the former I get escaped unicode characters and in the latter UTF-8 ones.
I thought that it could be a 5.6 vs 5.8 difference but as you can see even under Debian Sid I got the same unicode chars.
Could it be an OpenBSD peculiarity? I've Googled it but with no luck, maybe someone can shed some light on it...

Thanks!

2005-01-09 Janitored by Arunbear - added code tags around long hex strings, to prevent distortion of site layout

Replies are listed 'Best First'.
Re: Differences in UTF-8 html form
by iblech (Friar) on Jan 09, 2005 at 18:26 UTC

    A guess:

    What do you send as Content-Type in the HTTP headers? (Moz/FF: Right Click -> View Page Info -> Encoding)

    Maybe the Moz/FF on OpenBSD prefers the HTTP Content-Type, and ignores the one you set in the HTML, thus not sending utf-8. And the Moz/FF on Linux ignores the HTTP Content-Type, respecting the HTML <meta>, and thus sends correct utf-8.

    Just a guess, though.

      Maybe it wasn't clear in my posting, but it was the server OS that changed, not the client. I used Mozilla and Firefox as clients under Debian Woody with ISO-8859-1 encoding.
      Thanks anyway for your reply.
I'm troubleshooting post requests with german chars...
by tphyahoo (Vicar) on Jan 10, 2005 at 12:29 UTC
    I'm currently troubleshooting a problem with html posts requests involving german html characters, at problem with german chars in html post fetch. This is naked lwp not mason, and german not hungarian, but I still thought the problem space was similar enough that it might shed light on your situation.

    Basically, before sending my post request I try encoding it several ways -- cgi::enurl, escape::uri_escape, and even a regex suggested in perlfaq9 -- but none of these work, and I wind up having to create my own function to get my post request to work correctly.

    I'm not really happy with my solution as it is very un-DWIM, but at least it's working.

    Hope this helps!

      I'd love to see your function, if you can share it...
      Here or by mail if you prefer (realbot |at| gmail |dot| com)

      Thanks!
        Sorry, forgot to link to problem with german chars in html post fetch but I just went back and fixed (above).

        I actually did two code postings in the thread because my approach to the problem evolved, and I judged it would make the original question harder to understand if I just updated... so to get the most knowledge about the weirdness I was encountering, I'd read through the whole thread.

        But if you just want the function snipped, it was

        # Takes a variable and spits it back out with the proper german charac +ters sub germanchars_to_strange_html_chars { my $var = shift; my %table = ( '' => 'ß', '' => 'ä', '' => 'ö', '' => 'ä', '' => 'ö', '' => 'ü', '' => 'ü'); while (my ($k,$v) = each %table) { $var =~ s/$k/$v/g; } return $var; }
        I don't think this is a particular solution to the problems you were having in hungarian, just that my approach might be helpful.

        Also, you might want to know, how did I determine the substitutions for the function?

        I did the post request manually with firefox, and then did file->save as. The funky characters were then culled from the result. Truly a kludge, and I'm sure there's a better way to do it, but until I figure it out, that's what I'm left with.

        thomas.

        To further distill what seems to be the problem, the crux is that CGI::enurl('brse') results in 'b%F6rse' thing for me, but 'börse' for holli, another perlmonk working with german characters, who tried to help me.

        The result of this is that cgi post requests work for holli after cgi::enurl-ing, but they fail for me. Very un-dwimmy.

        I suspect this has something to do with differences between the default encoding on my system, holli's system, and maybe realbot's system. And I also suspect this has to do with how the perl encode works. But I am at a loss for how to isolate this.

        Hope this leads to a solution for both of us somehow...

        thomas.

        Realbot, after hours of headscratching I finally found a solution to my problem, which I recorded at The problem was utf-8 versus windows ansi.

        Glad you found a solution. For those running into this kind of issue with automated crawling via LWP calls, it may be helpful to run CGI::ENURL::enurl on input data before getting/posting. And make sure that the script is saved in utf-8 format, or CGI::ENURL won't do its job right.

        YMMV...

Re: Differences in UTF-8 html form
by jmanning2k (Pilgrim) on Jan 10, 2005 at 17:01 UTC

    Can you check the values of the environment var LANG on both server systems? I suspect one is set to something like 'en_US.ISO8859-1' and the other is 'en_US.UTF-8'. (where en_US may be replaced with your local default language/charset)

    Remember to check as the user running the web server. However, if your shell startup scripts don't mess with lang, they are probably the same for all users.

    Setting LANG to 'C' or both to the same value will probably give consistent formatting. You can probably do that in mod_perl startup.pl or at the start of a standalone cgi app.

    $ENV{'LANG'} = 'en_US.UTF-8';

    ~J
Re: Differences in UTF-8 html form
by Realbot (Scribe) on Jan 10, 2005 at 18:25 UTC
    Dear Monks,
    I've found the reason of the differences.
    In Debian versions Apache in configured with
    # Default charset to iso-8859-1 (ttp://www.apache.org/info/css-securit +y/). + AddDefaultCharset on
    which completely ignores the encoding given in META tag and uses always the default encoding. In Apache installation under OpenBSD the parameter was not present and so was correct.
    When I removed that nasty parameter everything worked on Debian also...
    Hope it helps you, too!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://420673]
Approved by thor
Front-paged by mowgli
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2022-12-08 03:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?