Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Thanks for this. You're right. I am not actually too too worried about the data we already have, as it is all latin1. So, you're right in that converting it all to utf-8 is trivial. My concern is the increasing tendency for the company to internationalize, so it is just a matter of time before we start getting characters that are not valid latin1. I had thought that until I got my database converted, I might handle it by converting the utf-8 data into something that could be stored in a latin1 database, and convert that back to utf-8 when it is to be displayed subsequently (a temporary procedure until I finish converting my database - but I suppose use of that might only make the transition harder in terms of having all the data in utf-8 eventually as the encoded data would have to be unencoded).

On the one hand, I have to change some of our forms to utf-8 encoding, because, to reduce data entry errors, I have to use the locales packages to display countries and smaller administrative units in chained dropdown boxes, and these do not display correctly unless the web page is utf-8. On the other hand, some of the data comes from a feed from another company, and they are entirely utf-8. A colleague of mine, dealing with the same feed, delt with it by determining what utf-8 character, which was not a valid latin1 character, was found in the feed at a given time, and he used that info to construct a regex to filter out utf-8 characters that could not be accomodated in his latin1 database. Obviously, I find his approach distasteful at best because it discards data, and the user can never see exactly what he had originally entered; and his code grew increasingly ugly as it accumulated dozens of lines applying one regex filter after another. In both cases, occurance of a utf-8 character that is not a valid latin1 character causes the SQL that inserts it into the db to fail, and that in turn leads to hours of work to determine precisely what data didn't make it into the DB, and to 'edit' the data so it could be inserted into the DB in some form.

I want to do the opposite of what my colleague did, and just convert the whole thing, eventually, to utf-8, so that the db holds, and the app displays, the data exactly as entered.

Now this raises a question as to a) how do I determine the range of acceptable utf-8 characters you speak of (and express that in code), and b) how do I do I express such a constraint in my documentation, so that integrators that code to my API know to use only characters in the acceptable range? I'd also have to put something on my web-forms to indicate to the user to not bother entering characters outside the acceptable range; but how do I do that in a way that is readily understood by most users? The last thing I want to happen is either that my code dies a nasty death because someone entered data I can't handle or that a user enters such data and gets either no result or gibberish back. It is better that the user knows ahead of time just not to bother entering certain sets of characters. I suppose if the 'filter' you speak of can be used within Data::FormValidator,JavaScript::DataFormValidator could use the same rule to prevent the users of my forms from entering data I can't handle. But I'd still have to document the constraint for the users of my API.



In reply to Re^2: How to generate random sequence of UTF-8 characters by ted.byers
in thread How to generate random sequence of UTF-8 characters by ted.byers

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    [Corion]: A good weekstart to everybody!

    How do I use this? | Other CB clients
    Other Users?
    Others about the Monastery: (7)
    As of 2018-06-18 08:18 GMT
    Find Nodes?
      Voting Booth?
      Should cpanminus be part of the standard Perl release?

      Results (109 votes). Check out past polls.