comment on

As your "next first step", I would strongly recommend some detailed diagnosis of the non-ASCII content in your data. It seems pretty clear that the stuff from your "old, non-strict database" is not utf8, and you seem to expect that there might be a mixture of different encodings being used for the characters that are not ASCII.

So, locate the rows that contain non-ASCII characters in one or more fields, isolate those fields, and look at them in a way that shows what the non-ASCII characters are, and where they are in the string. From that, you might be able to figure out (based on the ASCII characters in the context, if any) what each non-ASCII character should be (that is, which character of which character set).

Then all you need to do is to create edited versions of the affected rows, replacing the non-ASCII characters with their correct utf8 equivalents.

Here is the code locate and print (in human-readable form) the affected rows:

#!/usr/bin/perl -n

print "$.:\t$_" if ( s/([^\x00-\x7f])/sprintf("\\x{%02x}",ord($1))/eg 
+);
[download]

If your data contains, e.g., a row with the single-byte à (cp1252 or iso-8859-1 "letter a with grave accent") between spaces, the program above will print the row with that letter being shown as follows:

NNN:    .... \x{e0} ...
[download]

(where "NNN" is the line number in the input file, and "..." is whatever comes before and/or after " à ", and "e0" is the hex numeric value of that byte/character) Note that this script treats the input as raw binary (or at least, it should, unless your shell environment is messing that up). If there are any multi-byte characters in the data, they will appear as sequences of two (or more) consecutive "\x{hh}" strings.

If you find that all the rows with non-ascii data are using the same encoding, then the job is easy: use Encode (as suggested above) to convert the whole data stream from that encoding to utf8. If different encodings are used in different rows, you'll need to create some sort of mapping table, keyed by row number or something, to associate the various rows with their various appropriate encodings.

In reply to Re: UTF8 Validity by graff
in thread UTF8 Validity by menolly

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks