Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: How to generate random sequence of UTF-8 characters

by ted.byers (Scribe)
on Dec 21, 2012 at 06:42 UTC ( #1009878=note: print w/ replies, xml ) Need Help??


in reply to Re: How to generate random sequence of UTF-8 characters
in thread How to generate random sequence of UTF-8 characters

Thanks for this. You're right. I am not actually too too worried about the data we already have, as it is all latin1. So, you're right in that converting it all to utf-8 is trivial. My concern is the increasing tendency for the company to internationalize, so it is just a matter of time before we start getting characters that are not valid latin1. I had thought that until I got my database converted, I might handle it by converting the utf-8 data into something that could be stored in a latin1 database, and convert that back to utf-8 when it is to be displayed subsequently (a temporary procedure until I finish converting my database - but I suppose use of that might only make the transition harder in terms of having all the data in utf-8 eventually as the encoded data would have to be unencoded).

On the one hand, I have to change some of our forms to utf-8 encoding, because, to reduce data entry errors, I have to use the locales packages to display countries and smaller administrative units in chained dropdown boxes, and these do not display correctly unless the web page is utf-8. On the other hand, some of the data comes from a feed from another company, and they are entirely utf-8. A colleague of mine, dealing with the same feed, delt with it by determining what utf-8 character, which was not a valid latin1 character, was found in the feed at a given time, and he used that info to construct a regex to filter out utf-8 characters that could not be accomodated in his latin1 database. Obviously, I find his approach distasteful at best because it discards data, and the user can never see exactly what he had originally entered; and his code grew increasingly ugly as it accumulated dozens of lines applying one regex filter after another. In both cases, occurance of a utf-8 character that is not a valid latin1 character causes the SQL that inserts it into the db to fail, and that in turn leads to hours of work to determine precisely what data didn't make it into the DB, and to 'edit' the data so it could be inserted into the DB in some form.

I want to do the opposite of what my colleague did, and just convert the whole thing, eventually, to utf-8, so that the db holds, and the app displays, the data exactly as entered.

Now this raises a question as to a) how do I determine the range of acceptable utf-8 characters you speak of (and express that in code), and b) how do I do I express such a constraint in my documentation, so that integrators that code to my API know to use only characters in the acceptable range? I'd also have to put something on my web-forms to indicate to the user to not bother entering characters outside the acceptable range; but how do I do that in a way that is readily understood by most users? The last thing I want to happen is either that my code dies a nasty death because someone entered data I can't handle or that a user enters such data and gets either no result or gibberish back. It is better that the user knows ahead of time just not to bother entering certain sets of characters. I suppose if the 'filter' you speak of can be used within Data::FormValidator,JavaScript::DataFormValidator could use the same rule to prevent the users of my forms from entering data I can't handle. But I'd still have to document the constraint for the users of my API.

Thanks

Ted


Comment on Re^2: How to generate random sequence of UTF-8 characters
Re^3: How to generate random sequence of UTF-8 characters
by graff (Chancellor) on Dec 21, 2012 at 07:29 UTC
    In terms of determining the "acceptability" of posted data, your primary (and perhaps only) concern should be to protect against input that could cause damage or that isn't parsable as utf8. So the common protections against using untrusted data in sensitive operations, together with checks for "\x{fffd}" ought to suffice in terms of technical validation.

    If you're trying to collect content from users who might be trying to spoof your forms just for the fun of loading your database with noise, you can't solve that just by limiting the range of characters available for submission. You'll have belabored your code, and people will still be able to post garbage (so long as they stay within the prescribed set of characters).

    If your users are limited to the set of people actually trying to do something productive with your service, you should be able to trust them to use just the characters that make sense to them - even if you don't know the full range of characters that they might find useful for a given form submission. If it's valid utf8, and you're just going to store it safely in a table, there's nothing more for you to worry about. You probably should make it easy for them to confirm that the content can make the trip back to them intact, so they can be confident that you got the data that they intended to send.

    If a given input field is supposed to contain just digits, you can still check for digits in utf8 strings usind  m/^\d+$/ regardless whether those digits are in the ASCII range, the "full width" range often used in combination with Chinese characters, the Arabic range, or whatever. Likewise for letters (\pL), diacritic marks (\pM), punctuation (\pP) and other such character classes. If you need to enforce "language consistency" - e.g. when users go to a Russian form, you should expect them to post only Russian characters in some fields - there are classes for that too (e.g. "\p{Cyrillic}" - perlunicode and perlre have more info on character class operators for ut8-aware regexes.)

    So, just like when you only had to worry about Latin1, the expected content (and/or use) of particular input field is what determines the conditions you test for in validation. The basic strategy is the same when the encoding is utf8 - you just have a larger range of predefined character classes to work with (and a more nuanced interpretation of the classes you've already been using).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1009878]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2014-07-31 11:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (248 votes), past polls