UTF8 Validity

menolly has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: UTF8 Validity by Juerd (Abbot) on Feb 21, 2008 at 19:48 UTC
utf8::valid checks the internal consistency for a string. On the outside, it does not have anything to do with UTF-8 encoding at all.	[reply]
Re^2: UTF8 Validity by menolly (Hermit) on Feb 21, 2008 at 19:57 UTC
Ah. I tried it based on this reply, in the other thread. Is there a module/regex/etc. that I can use to detect non-utf8 data in a string?	[reply]
Re^3: UTF8 Validity by Juerd (Abbot) on Feb 21, 2008 at 20:02 UTC
Just try to decode something as UTF-8. If that fails, fall back to something else. For example: `$foo = eval { decode("UTF-8", $foo, Encode::FB_CROAK) } \|\| decode("CP1252", $foo);` [download]	[reply] [d/l]
Re: UTF8 Validity by graff (Chancellor) on Feb 21, 2008 at 22:47 UTC
As your "next first step", I would strongly recommend some detailed diagnosis of the non-ASCII content in your data. It seems pretty clear that the stuff from your "old, non-strict database" is not utf8, and you seem to expect that there might be a mixture of different encodings being used for the characters that are not ASCII. So, locate the rows that contain non-ASCII characters in one or more fields, isolate those fields, and look at them in a way that shows what the non-ASCII characters are, and where they are in the string. From that, you might be able to figure out (based on the ASCII characters in the context, if any) what each non-ASCII character should be (that is, which character of which character set). Then all you need to do is to create edited versions of the affected rows, replacing the non-ASCII characters with their correct utf8 equivalents. Here is the code locate and print (in human-readable form) the affected rows: `#!/usr/bin/perl -n print "$.:\t$_" if ( s/([^\x00-\x7f])/sprintf("\\x{%02x}",ord($1))/eg +);` [download] If your data contains, e.g., a row with the single-byte à (cp1252 or iso-8859-1 "letter a with grave accent") between spaces, the program above will print the row with that letter being shown as follows: `NNN: .... \x{e0} ...` [download] (where "NNN" is the line number in the input file, and "..." is whatever comes before and/or after " à ", and "e0" is the hex numeric value of that byte/character) Note that this script treats the input as raw binary (or at least, it should, unless your shell environment is messing that up). If there are any multi-byte characters in the data, they will appear as sequences of two (or more) consecutive "\x{hh}" strings. If you find that all the rows with non-ascii data are using the same encoding, then the job is easy: use Encode (as suggested above) to convert the whole data stream from that encoding to utf8. If different encodings are used in different rows, you'll need to create some sort of mapping table, keyed by row number or something, to associate the various rows with their various appropriate encodings.	[reply] [d/l] [select]
Re^2: UTF8 Validity by menolly (Hermit) on Feb 22, 2008 at 00:47 UTC
Thanks; that's the kind of pointer I need. Most of my non-ASCII/non-UTF8 data is either in contact data or easily connected to contact data, so I've been trying to guess the charset based on the geographic origin, with mixed results. I definitely have multiple encodings present -- so far, there's cp1251 (Cyrillic), latin1, some form of Japanese, and something I can't identify but have scrubbed out in the source DB.	[reply]
Re^3: UTF8 Validity by graff (Chancellor) on Feb 22, 2008 at 02:18 UTC
Encode::Guess is likely to be helpful for figuring out the source encodings for many of the Asian (multi-byte-char) strings, though it might not help much for distinguishing among single-byte encodings. Worth a try.	[reply]
Re^4: UTF8 Validity by Anonymous Monk on Feb 22, 2008 at 11:07 UTC
Re^5: UTF8 Validity by menolly (Hermit) on Feb 22, 2008 at 18:23 UTC
Re: UTF8 Validity by ikegami (Patriarch) on Feb 21, 2008 at 20:10 UTC
For all my encoding needs, I use Encode. Encode concurs that `"\xCC\xE5"` is invalid UTF-8. `use Encode qw( decode FB_CROAK ); my $s_utf8 = "\xCC\xE5"; print("utf8::valid = ", utf8::valid($s_utf8)?1:0,"\n"); my $s_chars = eval { decode('UTF-8', $s_utf8, FB_CROAK) }; print("valid decode = ", defined($s_chars)?1:0,"\n");` [download] `utf8::valid = 1 valid decode = 0` [download] I'm having problems locating where `utf8::valid` is implemented, so I don't know why it considers the string valid. I suspect it returns true for all strings of bytes (aka binary strings, strings with UTF8 flag off) such as `$s_utf8` in my code.	[reply] [d/l] [select]
Re: UTF8 Validity by MZSanford (Curate) on Feb 21, 2008 at 21:34 UTC
It is valid UTF-16 ... perhaps that's what you're dealing with. A good resource for this is the fileformat.info character reference page	[reply]
Re: UTF8 Validity by moritz (Cardinal) on Feb 22, 2008 at 10:34 UTC
This is slightly off-topic here, but perhaps it's useful for someone some day: UTF8 ne UTF-8. You can have a string that is valid UTF8 but not valid UTF-8 (UTF-8 is more strict, and allows just one way to encode each codepoint, UTF8 also allows non-canonical encodings). That was my first thought when I read the title "UTF8 Validity", which is not "UTF-8 Validity" ;-)	[reply]


Problems? Is your data what you think it is?
	PerlMonks