Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
As your "next first step", I would strongly recommend some detailed diagnosis of the non-ASCII content in your data. It seems pretty clear that the stuff from your "old, non-strict database" is not utf8, and you seem to expect that there might be a mixture of different encodings being used for the characters that are not ASCII.

So, locate the rows that contain non-ASCII characters in one or more fields, isolate those fields, and look at them in a way that shows what the non-ASCII characters are, and where they are in the string. From that, you might be able to figure out (based on the ASCII characters in the context, if any) what each non-ASCII character should be (that is, which character of which character set).

Then all you need to do is to create edited versions of the affected rows, replacing the non-ASCII characters with their correct utf8 equivalents.

Here is the code locate and print (in human-readable form) the affected rows:

#!/usr/bin/perl -n print "$.:\t$_" if ( s/([^\x00-\x7f])/sprintf("\\x{%02x}",ord($1))/eg +);
If your data contains, e.g., a row with the single-byte à (cp1252 or iso-8859-1 "letter a with grave accent") between spaces, the program above will print the row with that letter being shown as follows:
NNN: .... \x{e0} ...
(where "NNN" is the line number in the input file, and "..." is whatever comes before and/or after " à ", and "e0" is the hex numeric value of that byte/character) Note that this script treats the input as raw binary (or at least, it should, unless your shell environment is messing that up). If there are any multi-byte characters in the data, they will appear as sequences of two (or more) consecutive "\x{hh}" strings.

If you find that all the rows with non-ascii data are using the same encoding, then the job is easy: use Encode (as suggested above) to convert the whole data stream from that encoding to utf8. If different encodings are used in different rows, you'll need to create some sort of mapping table, keyed by row number or something, to associate the various rows with their various appropriate encodings.

In reply to Re: UTF8 Validity by graff
in thread UTF8 Validity by menolly

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others drinking their drinks and smoking their pipes about the Monastery: (4)
    As of 2020-11-28 23:21 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found