Re: UTF8 Validity

As your "next first step", I would strongly recommend some detailed diagnosis of the non-ASCII content in your data. It seems pretty clear that the stuff from your "old, non-strict database" is not utf8, and you seem to expect that there might be a mixture of different encodings being used for the characters that are not ASCII.

So, locate the rows that contain non-ASCII characters in one or more fields, isolate those fields, and look at them in a way that shows what the non-ASCII characters are, and where they are in the string. From that, you might be able to figure out (based on the ASCII characters in the context, if any) what each non-ASCII character should be (that is, which character of which character set).

Then all you need to do is to create edited versions of the affected rows, replacing the non-ASCII characters with their correct utf8 equivalents.

Here is the code locate and print (in human-readable form) the affected rows:

#!/usr/bin/perl -n

print "$.:\t$_" if ( s/([^\x00-\x7f])/sprintf("\\x{%02x}",ord($1))/eg 
+);
[download]

If your data contains, e.g., a row with the single-byte à (cp1252 or iso-8859-1 "letter a with grave accent") between spaces, the program above will print the row with that letter being shown as follows:

NNN:    .... \x{e0} ...
[download]

(where "NNN" is the line number in the input file, and "..." is whatever comes before and/or after " à ", and "e0" is the hex numeric value of that byte/character) Note that this script treats the input as raw binary (or at least, it should, unless your shell environment is messing that up). If there are any multi-byte characters in the data, they will appear as sequences of two (or more) consecutive "\x{hh}" strings.

If you find that all the rows with non-ascii data are using the same encoding, then the job is easy: use Encode (as suggested above) to convert the whole data stream from that encoding to utf8. If different encodings are used in different rows, you'll need to create some sort of mapping table, keyed by row number or something, to associate the various rows with their various appropriate encodings.

Comment on Re: UTF8 Validity Select or Download Code

Replies are listed 'Best First'.
Re^2: UTF8 Validity by menolly (Hermit) on Feb 22, 2008 at 00:47 UTC
Thanks; that's the kind of pointer I need. Most of my non-ASCII/non-UTF8 data is either in contact data or easily connected to contact data, so I've been trying to guess the charset based on the geographic origin, with mixed results. I definitely have multiple encodings present -- so far, there's cp1251 (Cyrillic), latin1, some form of Japanese, and something I can't identify but have scrubbed out in the source DB.	[reply]
Re^3: UTF8 Validity by graff (Chancellor) on Feb 22, 2008 at 02:18 UTC
Encode::Guess is likely to be helpful for figuring out the source encodings for many of the Asian (multi-byte-char) strings, though it might not help much for distinguishing among single-byte encodings. Worth a try.	[reply]
Re^4: UTF8 Validity by Anonymous Monk on Feb 22, 2008 at 11:07 UTC
Encode::Guess is lame because the user needs to tell it which encoding the binary is. Use Encode::Detect instead. This is the same detector used in Mozilla browsers.	[reply]
Re^5: UTF8 Validity by menolly (Hermit) on Feb 22, 2008 at 18:23 UTC


Just another Perl shrine
	PerlMonks