in reply to How to generate random sequence of UTF-8 characters
The real problem I need to address is how best to manage a transition of our system (which is in production), from a state in which everything is encoded as latin1 to a state in which everything is encoded in utf-8. I thought, until I have sufficient time to test everything from the transformation of our tables from latin1 through binary to utf-8, and how well our code behaves when it is dealing only with utf-8, I'd first convert one form to utf-8 and then, on the server side, convert the text received from the form from utf-8 to latin1 (and then, when a user wants to see it, convert it back from latin1 to utf-8). But I want a good sample to test to determine whether or not such conversions are reversable.
Since your current production system is (presumably) humming along just fine with Latin1 encoding, then for the existing page content you have, and for the existing data that you've received and (presumably) retained in your database tables, you only have the Latin1 character set that you need to worry about. Converting that part to utf8 is easy - almost trivial.
Obviously, your existing content and data don't involve anything in Greek, Cyrillic, Chinese, Korean, Devanagari, Arabic, Armenian, etc, because those things can't coexist with single-byte Latin1 encoding. So converting the existing stuff to utf8 is a solved problem: just read the original stuff as Latin1 data, and write a copy for the new system as utf8 data - PerlIO's encoding layer takes care of that.
As for testing what your forms (and your code for handling uploaded form data) will do with utf8 data, if you really want to test it on every known utf8 character in some sort of randomized fashion, I think you've got the wrong conception of testing your code for this domain.
If you test all the Latin characters (there are a few hundred accented variants), some Greek, some Cyrillic, some Chinese/Korean/Japanese, some Thai, etc, you'll have done a sufficient job of making your service fairly robust to multi-language utf8 usage. If you want to cope with complicated character rendering, throw in some Devanagari-based data (Hindi), some Tamil and Malayalam; to get a sense of the punishment that is bi-directional text, toss in some Hebrew; to get both of those extra challenges in one swell foop, use Arabic, Persian (Farsi), and/or Urdu text.
And for heaven's sake (and your own peace of mind), make sure you can consult with people who know the languages you decide to test on, to make sure your test results are intelligible for users of those languages.
If you don't actually intend or expect to become a multi-alphabet service, then the overall task is phenomenally easier: once you handle the trivial conversion of your Latin1-based content to ut8 encoding, just make sure your "untainting" logic knows to accept Latin1 data and reject anything else. (Update: what I mean is: accept utf8-encoded latin characters, and perhaps some non-spacing combining accent marks, and reject utf8 characters outside that range.) Once you know that you can receive posted utf8 content correctly from your forms, you can trust a simple regex match to catch anything outside the range of expected/allowable characters.
Note that when people post garbage into a form that is supposed to upload as utf8 data, the server will see \x{fffd} characters (replacement characters, which are what you get when you try to treat a string as utf8, and it happens to contain a byte or byte sequence that violates the conditions for utf8 encoding). So you might want to watch out for that in particular.
I think it would be a waste of time to generate tons of "random character sequences" (e.g. randomly juxtaposing characters from different alphabets, syllabaries, sets of symbols, ideographs, etc), because it'll trigger all sorts of trouble that would (almost) never happen in the wild. (Update: what I mean is: displays will do weird stuff as a result, and the weirdness will be unrelated to whether your code is doing the right thing; meanwhile, you won't have tested for the more important issue of intelligibility.)
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: How to generate random sequence of UTF-8 characters
by ted.byers (Monk) on Dec 21, 2012 at 06:42 UTC | |
by graff (Chancellor) on Dec 21, 2012 at 07:29 UTC |