Although I'll second the suggestion of using the 'Accept-charset' header, I'm not so sure about user agents responding in the same encoding as the page
From RFC 2616 (HTTP/1.0):
I'm still not sure how to handle form data in the QUERY_STRING -- from section 2.1 of RFC 2396 (URI Syntax):
(If anyone knows of a followup RFC, I'd love to know what the number is)
And for the original poster, although Joel's article is a good start, it's intended as a quick overview -- I'd also suggest you take a look at A tutorial on character code issues
| [reply] [d/l] [select] |
That's right, there's no real standard way of telling a client how the URI for a GET should be encoded (and even if there is for POST, it seems most clients don't comply). However, practical experience with mainstream browsers lead to this conclusion (http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html).
By now (2005) the robust way to deal with this issue is to send out forms pages encoded in utf-8, expecting the forms input to be submitted back using that encoding. This has been in practical use for a couple of years now (e.g at Google) and can be expected to work with any current HTML4-compatible browser. However, there are other browsers still in use which don't fit this description, so it still seems relevant to look at the theory and compare it with observations.
I've used this approach for several websites and it works with all the (reasonably recent) browsers I've tested.
"In theory, theory and practice are
the same, but in practice, they never are."
| [reply] |
Thanks for the reference -- I know sgifford had given it as well, but he seemed to just be quoting it, rather than mentioning the information it contained.
I hadn't seen the 'buzzword' concept presented before, but it seems like a simple hack to validate what's being sent back to you.
| [reply] |