Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I'm a little puzzled about this part of the OP:
The real problem I need to address is how best to manage a transition of our system (which is in production), from a state in which everything is encoded as latin1 to a state in which everything is encoded in utf-8. I thought, until I have sufficient time to test everything from the transformation of our tables from latin1 through binary to utf-8, and how well our code behaves when it is dealing only with utf-8, I'd first convert one form to utf-8 and then, on the server side, convert the text received from the form from utf-8 to latin1 (and then, when a user wants to see it, convert it back from latin1 to utf-8). But I want a good sample to test to determine whether or not such conversions are reversable.

Since your current production system is (presumably) humming along just fine with Latin1 encoding, then for the existing page content you have, and for the existing data that you've received and (presumably) retained in your database tables, you only have the Latin1 character set that you need to worry about. Converting that part to utf8 is easy - almost trivial.

Obviously, your existing content and data don't involve anything in Greek, Cyrillic, Chinese, Korean, Devanagari, Arabic, Armenian, etc, because those things can't coexist with single-byte Latin1 encoding. So converting the existing stuff to utf8 is a solved problem: just read the original stuff as Latin1 data, and write a copy for the new system as utf8 data - PerlIO's encoding layer takes care of that.

As for testing what your forms (and your code for handling uploaded form data) will do with utf8 data, if you really want to test it on every known utf8 character in some sort of randomized fashion, I think you've got the wrong conception of testing your code for this domain.

If you test all the Latin characters (there are a few hundred accented variants), some Greek, some Cyrillic, some Chinese/Korean/Japanese, some Thai, etc, you'll have done a sufficient job of making your service fairly robust to multi-language utf8 usage. If you want to cope with complicated character rendering, throw in some Devanagari-based data (Hindi), some Tamil and Malayalam; to get a sense of the punishment that is bi-directional text, toss in some Hebrew; to get both of those extra challenges in one swell foop, use Arabic, Persian (Farsi), and/or Urdu text.

And for heaven's sake (and your own peace of mind), make sure you can consult with people who know the languages you decide to test on, to make sure your test results are intelligible for users of those languages.

If you don't actually intend or expect to become a multi-alphabet service, then the overall task is phenomenally easier: once you handle the trivial conversion of your Latin1-based content to ut8 encoding, just make sure your "untainting" logic knows to accept Latin1 data and reject anything else. (Update: what I mean is: accept utf8-encoded latin characters, and perhaps some non-spacing combining accent marks, and reject utf8 characters outside that range.) Once you know that you can receive posted utf8 content correctly from your forms, you can trust a simple regex match to catch anything outside the range of expected/allowable characters.

Note that when people post garbage into a form that is supposed to upload as utf8 data, the server will see \x{fffd} characters (replacement characters, which are what you get when you try to treat a string as utf8, and it happens to contain a byte or byte sequence that violates the conditions for utf8 encoding). So you might want to watch out for that in particular.

I think it would be a waste of time to generate tons of "random character sequences" (e.g. randomly juxtaposing characters from different alphabets, syllabaries, sets of symbols, ideographs, etc), because it'll trigger all sorts of trouble that would (almost) never happen in the wild. (Update: what I mean is: displays will do weird stuff as a result, and the weirdness will be unrelated to whether your code is doing the right thing; meanwhile, you won't have tested for the more important issue of intelligibility.)

In reply to Re: How to generate random sequence of UTF-8 characters by graff
in thread How to generate random sequence of UTF-8 characters by ted.byers

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others lurking in the Monastery: (7)
    As of 2018-06-21 19:01 GMT
    Find Nodes?
      Voting Booth?
      Should cpanminus be part of the standard Perl release?

      Results (118 votes). Check out past polls.