http://www.perlmonks.org?node_id=1009778

ted.byers has asked for the wisdom of the Perl Monks concerning the following question:

If I were to generate a sequence of all possible alphanumeric ASCII characters, it would be trivially simple. I would create an array as follows:

 my @chars = qw( 1 2 3 4 5 6 7 8 9 0 a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z );

I would then use the mersenne twister random number generator to select a character from the array to add onto one of the random sequences in the sample of such sequences I am attempting to construct. In fact, my favourite password generator uses a modification of this to produce very strong, memorable, passwords. The question is, though, what would I add to @chars in order to be able to generate a set of random sequences that, together, are certain to contain all possible valid UTF-8 characters (some of which, I understand, can need as many as 6 bytes to represent them). Or is there a better way to generate samples of random sequences in which the sample is certain to completely cover the sample space? This is with the caveat that I need only alpha-numeric characters as the purpose involves testing the ability of my code to handle text and numbers entered by a user on a UTF-8 encoded web page. Thus, non-printable characters, control characters, &c, while they may be well defined, are not of interest. I will need to untaint this data, and store it in my DB.

The real problem I need to address is how best to manage a transition of our system (which is in production), from a state in which everything is encoded as latin1 to a state in which everything is encoded in utf-8. I thought, until I have sufficient time to test everything from the transformation of our tables from latin1 through binary to utf-8, and how well our code behaves when it is dealing only with utf-8, I'd first convert one form to utf-8 and then, on the server side, convert the text received from the form from utf-8 to latin1 (and then, when a user wants to see it, convert it back from latin1 to utf-8). But I want a good sample to test to determine whether or not such conversions are reversable.

I'd welcome suggestions for handling either the random sequences of UTF-8 characters or a transaition of a data driven web applcation from latin1 to utf-8, or both, well.

Thanks

Ted

Replies are listed 'Best First'.
Re: How to generate random sequence of UTF-8 characters
by davido (Cardinal) on Dec 20, 2012 at 20:17 UTC

    Bytes::Random::Secure: Most of this module deals with returning random bytes, but there is one function that returns a random string by picking octets at random from a user-supplied "bag" string. It works with Unicode strings as well (it turned out to be a simple thing to implement, despite the module's "Bytes" name).

    use strict; use warnings; use Bytes::Random::Secure qw( random_string_from ); use feature qw/unicode_strings/; binmode STDOUT, ':utf8'; my $string = random_string_from( join( '', map { chr($_ ) } 0 .. 0x10FFFF ), # Generate a "bag" st +ring. 16384 # Return a string of 16384 octets from the "bag". ); print $string, "\n"; print length($string), "\n";

    Not terribly efficient/, but if you precompute the "bag" string it's efficient enough for most purposes.

    Care was taken to ensure that there's no modulo bias, which would be likely to turn up in simplisitic solutions. The random generator is the ISAAC algorithm. And the generator is seeded using the strongest source available on the target platform. Performance can be improved by making sure that Math::Random::ISAAC::XS is installed on the target system; the ISAAC generator will use it if it's available. On Windows systems there's one additional dependency to assure a strong seed. You'll have to read the POD if that's an issue.


    Dave

      Thanks Dave. I will take a look at that

      Thanks

      Ted

Re: How to generate random sequence of UTF-8 characters
by bart (Canon) on Dec 20, 2012 at 19:51 UTC
    A random sequence of all possible UTF-8 characters? That is going to be quire a huge string.

    Anyway, don't worry about UTF-8. You must just create characters, which you can do with chr($code), where $code is an integer, a valid UNICODE character code point, which is basically any value between 0 and 0x10FFFF — except some of those will represent unprintable characters, such as control characters, and you probably don't want those.

    Let Perl worry about converting it to UTF-8.

      Thanks. I will give it a try.

      There is a point of misunderstanding, though, and that is I am aiming for a sample of random sequences. Each sequence would be five to ten characters, but the sample would be comprised of a few million such sequences. Thus, if my sample size is ten million strings, and each string is ten characters, and there are a million valid utf-8 characters, the each character would be in the sample an average of 100 times. It is a statistical approach; each item in the sample has just a tiny portion of all possible values, but the whole sample includes all possible values multiple times. I tend to be a bit thorough when testing code I am not familiar with (my code for computing eigensystems of general matrices was testing on 100 million randomly generated matrices - with not one failure BTW).

      Thanks again.

      Ted

        Ah, OK. Well the system is the same as you used to generate passwords: once you have an array of valid code points for the characters, you can choose 10 random codepoints from that array, like
        $chosencodepoint = $codepoints[int rand @codepoints];
        and construct the string, for example with:
        $string = pack 'U*', @chosencodepoints;
        (pack('U', $int) is pretty much equivalent to chr($int) except it also guarantees the output will be turned into UTF-8, and most of all: with the star this can easily join multiple characters without an explicit join.)

        My point was: when printing it out, Perl will convert it to valid UTF-8. Don't worry about that.

        An in case you want no duplicates, you can repeat the process for each character until you find no duplicates. With so many characters to choose from that is virtually guaranteed to be faster than shuffling the whole array (with the Fisher Yates shuffle) and next picking the first 10 code points. Or, you can make custom version of Fisher-Yates that stops shuffling after 10 iterations.

Re: How to generate random sequence of UTF-8 characters
by tobyink (Canon) on Dec 20, 2012 at 20:26 UTC

    Here's a string containing every Unicode word character (i.e. alphanumeric plus underscore) in a random order.

    use List::Util 'shuffle'; my $all = join q[], grep /^\w$/, map chr, shuffle 0 .. 0x10FFFF;

    Now split that up into parts of whatever length you like!

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      Thanks. I will experiment with that.

      Thanks

      Ted

Re: How to generate random sequence of UTF-8 characters
by johngg (Canon) on Dec 20, 2012 at 19:56 UTC
    it would be trivially simple

    But very long-winded! Perhaps using Range Operators and map would save some typing?

    $ perl -Mstrict -Mwarnings -E ' > my @chars = q{0} .. q{9}; > push @chars, map { $_ => uc } q{a} .. q{z}; > say for @chars;' 0 1 2 3 4 5 6 7 8 9 a A b B c C ... y Y z Z $

    This does not address your problem but I hope it is of interest.

    Cheers,

    JohnGG

      Here's another way:

      my @chars = ( 0..9,'a'..'z','A'..'Z' ); print @chars; __DATA__ 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

      update: And one more:

      my $string = join '',0..9,'a'..'z'; $string =~ s/([a-z])/$1\u$1/g; print $string; __DATA__ 0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

        Thanks lotus and JohnGG. Both solutions are quite interesting.

        Thanks

        Ted

Re: How to generate random sequence of UTF-8 characters
by BrowserUk (Patriarch) on Dec 20, 2012 at 19:52 UTC
    How best to manage a transition of our system (which is in production), from a state in which everything is encoded as latin1 to a state in which everything is encoded in utf-8.

    Give up now.

    Or, read the Encoding module documentation until your forehead is flat and the wall, dented and bloody; and then give up.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

Re: How to generate random sequence of UTF-8 characters
by ted.byers (Monk) on Dec 20, 2012 at 23:47 UTC

    OK. Slapping ideas together, and extracting commonalities, from among the responses so far and combining with constructing a random string, I tried the following experiment.

    use strict; use warnings; use Math::Random::MT::Auto::Range; binmode(STDOUT, ":utf8"); $| = 1; #the following exhausts all memory #my $all = join q[], grep /^\w$/, map chr, shuffle 0 .. 0x10FFFF; #print $all,"\n"; # # so I tried: my @chars = map(chr, 0 .. 0x10FFFF); my $rng = Math::Random::MT::Auto::Range->new(LO => 0, HI => 0x10FFFF, TYPE => 'INTEGER' +); # three options for doing the basically the same thing for (my $i = 0 ; $i < 10 ; $i++) { my $code = $rng->rrand; print "$i <=> $code <=>",chr($code),"\n"; print "\t$i <=> $code <=>",$chars[$code],"\n"; print "\t$i <=> $code <=>",pack("U",$code),"\n"; }

    There are a couple problems with this. chr generates some errors and warnings. There are a couple thousand instances of "UTF-16 surrogate 0xd81b at c:/Work/test.utf8.latin1.pl line 16.", and over a hundred instances of "Unicode non-character 0xfdd0 is illegal for interchange at c:/Work/test.utf8.latin1.pl line 16.", each for a different integer. Line 16 is the line where @chars is initialized. I suppose a couple thousand problem characters in @chars is not a huge issue, but given that I want to use it to test functions for converting from utf-8 to latin1 and back, I expect that if they happen to occur in my sample, there'd be some false indications of errors in these functions (from the Perl package 'Encode').

    The second problem is that though I put the statement "binmode(STDOUT, ":utf8");" at the start of the script, the printout contains only rectangles and square where the UTF-8 character ought to be; at least when I execute within Emacs. When I execute the script in the Windows commandline terminal, I invariably get gibberish (different for each character) the width of four characters. How, then, do I actually see the characters? I thought, probably mistakenly, I'd see a few greek or sanskrit characters or characters from other alphabets.

    What I was thinking of doing is amend the loop I show above to construct a single string from the ten characters produced, and then use the functions from 'Encode' to convert it to latin1, and then back to utf-8, to see if the input string and the output string are the same (and if not, then this whole idea is doomed to fail). I'd repeat this test for a few million random utf-8 strings, and if there are no failures, then I could use this idea as a temporary measure until I can test our systems to see how best to adapt to use of utf-8 throughout.

    What, then, do I do to exclude the integers that either result in a utf-16 surrogate and those that represent illegal characters? And is it possible for me to actually see the characters produced?

    Thanks

    Ted

      At this point you need to take a step back and consider what you actually want to achieve. To help that process take a look at Unicode#Code_point_planes_and_blocks and consider that large areas of the code space are undefined and that even the areas where there are defined characters it is very likely that your system doesn't have font support for the characters anyway. Also take a look at ISO-8859-1 - if all you are doing is checking conversion from latin 1 to UTF-8 then you have fewer than 256 characters to bother about.

      The other part of the problem is that Windows essentially does not do UTF-8. Internally Windows uses UTF-16LE, but most often what gets rendered is the Windows-1252 code page. Emacs seems to be doing the right thing - render a rectangle when the character is missing from whatever font is being used.

      True laziness is hard work
Re: How to generate random sequence of UTF-8 characters
by graff (Chancellor) on Dec 21, 2012 at 03:47 UTC
    I'm a little puzzled about this part of the OP:
    The real problem I need to address is how best to manage a transition of our system (which is in production), from a state in which everything is encoded as latin1 to a state in which everything is encoded in utf-8. I thought, until I have sufficient time to test everything from the transformation of our tables from latin1 through binary to utf-8, and how well our code behaves when it is dealing only with utf-8, I'd first convert one form to utf-8 and then, on the server side, convert the text received from the form from utf-8 to latin1 (and then, when a user wants to see it, convert it back from latin1 to utf-8). But I want a good sample to test to determine whether or not such conversions are reversable.

    Since your current production system is (presumably) humming along just fine with Latin1 encoding, then for the existing page content you have, and for the existing data that you've received and (presumably) retained in your database tables, you only have the Latin1 character set that you need to worry about. Converting that part to utf8 is easy - almost trivial.

    Obviously, your existing content and data don't involve anything in Greek, Cyrillic, Chinese, Korean, Devanagari, Arabic, Armenian, etc, because those things can't coexist with single-byte Latin1 encoding. So converting the existing stuff to utf8 is a solved problem: just read the original stuff as Latin1 data, and write a copy for the new system as utf8 data - PerlIO's encoding layer takes care of that.

    As for testing what your forms (and your code for handling uploaded form data) will do with utf8 data, if you really want to test it on every known utf8 character in some sort of randomized fashion, I think you've got the wrong conception of testing your code for this domain.

    If you test all the Latin characters (there are a few hundred accented variants), some Greek, some Cyrillic, some Chinese/Korean/Japanese, some Thai, etc, you'll have done a sufficient job of making your service fairly robust to multi-language utf8 usage. If you want to cope with complicated character rendering, throw in some Devanagari-based data (Hindi), some Tamil and Malayalam; to get a sense of the punishment that is bi-directional text, toss in some Hebrew; to get both of those extra challenges in one swell foop, use Arabic, Persian (Farsi), and/or Urdu text.

    And for heaven's sake (and your own peace of mind), make sure you can consult with people who know the languages you decide to test on, to make sure your test results are intelligible for users of those languages.

    If you don't actually intend or expect to become a multi-alphabet service, then the overall task is phenomenally easier: once you handle the trivial conversion of your Latin1-based content to ut8 encoding, just make sure your "untainting" logic knows to accept Latin1 data and reject anything else. (Update: what I mean is: accept utf8-encoded latin characters, and perhaps some non-spacing combining accent marks, and reject utf8 characters outside that range.) Once you know that you can receive posted utf8 content correctly from your forms, you can trust a simple regex match to catch anything outside the range of expected/allowable characters.

    Note that when people post garbage into a form that is supposed to upload as utf8 data, the server will see \x{fffd} characters (replacement characters, which are what you get when you try to treat a string as utf8, and it happens to contain a byte or byte sequence that violates the conditions for utf8 encoding). So you might want to watch out for that in particular.

    I think it would be a waste of time to generate tons of "random character sequences" (e.g. randomly juxtaposing characters from different alphabets, syllabaries, sets of symbols, ideographs, etc), because it'll trigger all sorts of trouble that would (almost) never happen in the wild. (Update: what I mean is: displays will do weird stuff as a result, and the weirdness will be unrelated to whether your code is doing the right thing; meanwhile, you won't have tested for the more important issue of intelligibility.)

      Thanks for this. You're right. I am not actually too too worried about the data we already have, as it is all latin1. So, you're right in that converting it all to utf-8 is trivial. My concern is the increasing tendency for the company to internationalize, so it is just a matter of time before we start getting characters that are not valid latin1. I had thought that until I got my database converted, I might handle it by converting the utf-8 data into something that could be stored in a latin1 database, and convert that back to utf-8 when it is to be displayed subsequently (a temporary procedure until I finish converting my database - but I suppose use of that might only make the transition harder in terms of having all the data in utf-8 eventually as the encoded data would have to be unencoded).

      On the one hand, I have to change some of our forms to utf-8 encoding, because, to reduce data entry errors, I have to use the locales packages to display countries and smaller administrative units in chained dropdown boxes, and these do not display correctly unless the web page is utf-8. On the other hand, some of the data comes from a feed from another company, and they are entirely utf-8. A colleague of mine, dealing with the same feed, delt with it by determining what utf-8 character, which was not a valid latin1 character, was found in the feed at a given time, and he used that info to construct a regex to filter out utf-8 characters that could not be accomodated in his latin1 database. Obviously, I find his approach distasteful at best because it discards data, and the user can never see exactly what he had originally entered; and his code grew increasingly ugly as it accumulated dozens of lines applying one regex filter after another. In both cases, occurance of a utf-8 character that is not a valid latin1 character causes the SQL that inserts it into the db to fail, and that in turn leads to hours of work to determine precisely what data didn't make it into the DB, and to 'edit' the data so it could be inserted into the DB in some form.

      I want to do the opposite of what my colleague did, and just convert the whole thing, eventually, to utf-8, so that the db holds, and the app displays, the data exactly as entered.

      Now this raises a question as to a) how do I determine the range of acceptable utf-8 characters you speak of (and express that in code), and b) how do I do I express such a constraint in my documentation, so that integrators that code to my API know to use only characters in the acceptable range? I'd also have to put something on my web-forms to indicate to the user to not bother entering characters outside the acceptable range; but how do I do that in a way that is readily understood by most users? The last thing I want to happen is either that my code dies a nasty death because someone entered data I can't handle or that a user enters such data and gets either no result or gibberish back. It is better that the user knows ahead of time just not to bother entering certain sets of characters. I suppose if the 'filter' you speak of can be used within Data::FormValidator,JavaScript::DataFormValidator could use the same rule to prevent the users of my forms from entering data I can't handle. But I'd still have to document the constraint for the users of my API.

      Thanks

      Ted

        In terms of determining the "acceptability" of posted data, your primary (and perhaps only) concern should be to protect against input that could cause damage or that isn't parsable as utf8. So the common protections against using untrusted data in sensitive operations, together with checks for "\x{fffd}" ought to suffice in terms of technical validation.

        If you're trying to collect content from users who might be trying to spoof your forms just for the fun of loading your database with noise, you can't solve that just by limiting the range of characters available for submission. You'll have belabored your code, and people will still be able to post garbage (so long as they stay within the prescribed set of characters).

        If your users are limited to the set of people actually trying to do something productive with your service, you should be able to trust them to use just the characters that make sense to them - even if you don't know the full range of characters that they might find useful for a given form submission. If it's valid utf8, and you're just going to store it safely in a table, there's nothing more for you to worry about. You probably should make it easy for them to confirm that the content can make the trip back to them intact, so they can be confident that you got the data that they intended to send.

        If a given input field is supposed to contain just digits, you can still check for digits in utf8 strings usind  m/^\d+$/ regardless whether those digits are in the ASCII range, the "full width" range often used in combination with Chinese characters, the Arabic range, or whatever. Likewise for letters (\pL), diacritic marks (\pM), punctuation (\pP) and other such character classes. If you need to enforce "language consistency" - e.g. when users go to a Russian form, you should expect them to post only Russian characters in some fields - there are classes for that too (e.g. "\p{Cyrillic}" - perlunicode and perlre have more info on character class operators for ut8-aware regexes.)

        So, just like when you only had to worry about Latin1, the expected content (and/or use) of particular input field is what determines the conditions you test for in validation. The basic strategy is the same when the encoding is utf8 - you just have a larger range of predefined character classes to work with (and a more nuanced interpretation of the classes you've already been using).