http://www.perlmonks.org?node_id=1009848


in reply to How to generate random sequence of UTF-8 characters

I'm a little puzzled about this part of the OP:
The real problem I need to address is how best to manage a transition of our system (which is in production), from a state in which everything is encoded as latin1 to a state in which everything is encoded in utf-8. I thought, until I have sufficient time to test everything from the transformation of our tables from latin1 through binary to utf-8, and how well our code behaves when it is dealing only with utf-8, I'd first convert one form to utf-8 and then, on the server side, convert the text received from the form from utf-8 to latin1 (and then, when a user wants to see it, convert it back from latin1 to utf-8). But I want a good sample to test to determine whether or not such conversions are reversable.

Since your current production system is (presumably) humming along just fine with Latin1 encoding, then for the existing page content you have, and for the existing data that you've received and (presumably) retained in your database tables, you only have the Latin1 character set that you need to worry about. Converting that part to utf8 is easy - almost trivial.

Obviously, your existing content and data don't involve anything in Greek, Cyrillic, Chinese, Korean, Devanagari, Arabic, Armenian, etc, because those things can't coexist with single-byte Latin1 encoding. So converting the existing stuff to utf8 is a solved problem: just read the original stuff as Latin1 data, and write a copy for the new system as utf8 data - PerlIO's encoding layer takes care of that.

As for testing what your forms (and your code for handling uploaded form data) will do with utf8 data, if you really want to test it on every known utf8 character in some sort of randomized fashion, I think you've got the wrong conception of testing your code for this domain.

If you test all the Latin characters (there are a few hundred accented variants), some Greek, some Cyrillic, some Chinese/Korean/Japanese, some Thai, etc, you'll have done a sufficient job of making your service fairly robust to multi-language utf8 usage. If you want to cope with complicated character rendering, throw in some Devanagari-based data (Hindi), some Tamil and Malayalam; to get a sense of the punishment that is bi-directional text, toss in some Hebrew; to get both of those extra challenges in one swell foop, use Arabic, Persian (Farsi), and/or Urdu text.

And for heaven's sake (and your own peace of mind), make sure you can consult with people who know the languages you decide to test on, to make sure your test results are intelligible for users of those languages.

If you don't actually intend or expect to become a multi-alphabet service, then the overall task is phenomenally easier: once you handle the trivial conversion of your Latin1-based content to ut8 encoding, just make sure your "untainting" logic knows to accept Latin1 data and reject anything else. (Update: what I mean is: accept utf8-encoded latin characters, and perhaps some non-spacing combining accent marks, and reject utf8 characters outside that range.) Once you know that you can receive posted utf8 content correctly from your forms, you can trust a simple regex match to catch anything outside the range of expected/allowable characters.

Note that when people post garbage into a form that is supposed to upload as utf8 data, the server will see \x{fffd} characters (replacement characters, which are what you get when you try to treat a string as utf8, and it happens to contain a byte or byte sequence that violates the conditions for utf8 encoding). So you might want to watch out for that in particular.

I think it would be a waste of time to generate tons of "random character sequences" (e.g. randomly juxtaposing characters from different alphabets, syllabaries, sets of symbols, ideographs, etc), because it'll trigger all sorts of trouble that would (almost) never happen in the wild. (Update: what I mean is: displays will do weird stuff as a result, and the weirdness will be unrelated to whether your code is doing the right thing; meanwhile, you won't have tested for the more important issue of intelligibility.)

Replies are listed 'Best First'.
Re^2: How to generate random sequence of UTF-8 characters
by ted.byers (Monk) on Dec 21, 2012 at 06:42 UTC

    Thanks for this. You're right. I am not actually too too worried about the data we already have, as it is all latin1. So, you're right in that converting it all to utf-8 is trivial. My concern is the increasing tendency for the company to internationalize, so it is just a matter of time before we start getting characters that are not valid latin1. I had thought that until I got my database converted, I might handle it by converting the utf-8 data into something that could be stored in a latin1 database, and convert that back to utf-8 when it is to be displayed subsequently (a temporary procedure until I finish converting my database - but I suppose use of that might only make the transition harder in terms of having all the data in utf-8 eventually as the encoded data would have to be unencoded).

    On the one hand, I have to change some of our forms to utf-8 encoding, because, to reduce data entry errors, I have to use the locales packages to display countries and smaller administrative units in chained dropdown boxes, and these do not display correctly unless the web page is utf-8. On the other hand, some of the data comes from a feed from another company, and they are entirely utf-8. A colleague of mine, dealing with the same feed, delt with it by determining what utf-8 character, which was not a valid latin1 character, was found in the feed at a given time, and he used that info to construct a regex to filter out utf-8 characters that could not be accomodated in his latin1 database. Obviously, I find his approach distasteful at best because it discards data, and the user can never see exactly what he had originally entered; and his code grew increasingly ugly as it accumulated dozens of lines applying one regex filter after another. In both cases, occurance of a utf-8 character that is not a valid latin1 character causes the SQL that inserts it into the db to fail, and that in turn leads to hours of work to determine precisely what data didn't make it into the DB, and to 'edit' the data so it could be inserted into the DB in some form.

    I want to do the opposite of what my colleague did, and just convert the whole thing, eventually, to utf-8, so that the db holds, and the app displays, the data exactly as entered.

    Now this raises a question as to a) how do I determine the range of acceptable utf-8 characters you speak of (and express that in code), and b) how do I do I express such a constraint in my documentation, so that integrators that code to my API know to use only characters in the acceptable range? I'd also have to put something on my web-forms to indicate to the user to not bother entering characters outside the acceptable range; but how do I do that in a way that is readily understood by most users? The last thing I want to happen is either that my code dies a nasty death because someone entered data I can't handle or that a user enters such data and gets either no result or gibberish back. It is better that the user knows ahead of time just not to bother entering certain sets of characters. I suppose if the 'filter' you speak of can be used within Data::FormValidator,JavaScript::DataFormValidator could use the same rule to prevent the users of my forms from entering data I can't handle. But I'd still have to document the constraint for the users of my API.

    Thanks

    Ted

      In terms of determining the "acceptability" of posted data, your primary (and perhaps only) concern should be to protect against input that could cause damage or that isn't parsable as utf8. So the common protections against using untrusted data in sensitive operations, together with checks for "\x{fffd}" ought to suffice in terms of technical validation.

      If you're trying to collect content from users who might be trying to spoof your forms just for the fun of loading your database with noise, you can't solve that just by limiting the range of characters available for submission. You'll have belabored your code, and people will still be able to post garbage (so long as they stay within the prescribed set of characters).

      If your users are limited to the set of people actually trying to do something productive with your service, you should be able to trust them to use just the characters that make sense to them - even if you don't know the full range of characters that they might find useful for a given form submission. If it's valid utf8, and you're just going to store it safely in a table, there's nothing more for you to worry about. You probably should make it easy for them to confirm that the content can make the trip back to them intact, so they can be confident that you got the data that they intended to send.

      If a given input field is supposed to contain just digits, you can still check for digits in utf8 strings usind  m/^\d+$/ regardless whether those digits are in the ASCII range, the "full width" range often used in combination with Chinese characters, the Arabic range, or whatever. Likewise for letters (\pL), diacritic marks (\pM), punctuation (\pP) and other such character classes. If you need to enforce "language consistency" - e.g. when users go to a Russian form, you should expect them to post only Russian characters in some fields - there are classes for that too (e.g. "\p{Cyrillic}" - perlunicode and perlre have more info on character class operators for ut8-aware regexes.)

      So, just like when you only had to worry about Latin1, the expected content (and/or use) of particular input field is what determines the conditions you test for in validation. The basic strategy is the same when the encoding is utf8 - you just have a larger range of predefined character classes to work with (and a more nuanced interpretation of the classes you've already been using).