Jenda has asked for the wisdom of the Perl Monks concerning the following question:
Does anyone have some code that could guess whether some text (bytes) are Latin1 or UTF8? These are the only options I need to distinguish so a regexp or something that would say "this can't be UTF8" would be just fine.
We get some XML to import from several different companies (new ones being added from time to time). Quite often I find out later that even though the XML either doesn't specify the encoding or claims to be UTF-8 it's actually Latin1. Which means that as soon as there are some accentuated or fancy characters the XML is rejected with an "not well-formed (invalid token)" message. (MS Word loves to convert quotes, ampersands and dashes to some extended chars).
Of course the proper solution is to force the other side to either convert the stuff to UTF-8 or change the XML header, but that often takes some time on their end and the clients are not happy in the meantime.
I know I can catch the "invalid token" error, tweak the XML header and try to parse the XML again. I'd like to try to find out before I start the parsing.
Thanks, Jenda
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
-- Rick Osborne
Re: Guess between UTF8 and Latin1/ISO-8859-1
by bart (Canon) on Jan 21, 2004 at 21:00 UTC
|
Sure. Using byte-wise processing, all UTF-8 characters with character code >= 128 must match the following pattern:
/[\xC0-\xFF][\x80-\xBF]+/
(Actually you can even put more stringent constraints on the byte sequence, but this will do for a start.)
It means that if you encounter anything matching /[\x80-\xFF]/ outside what's matched by the above pattern, it's not (valid) UTF-8. You can do this, for example, by using this:
my($utf8, $bare) = (0, 0);
use bytes;
while(/(?=[\x80-\xFF])(?:[\xC0-\xFF][\x80-\xBF]+|(.))/g) {
$bare++ if defined $1;
$utf8++ unless defined $1;
}
print <<"END"
utf-8: $utf8
bare: $bare
END
The idea behind the pattern is that the properly formed UTF-8 characters are eaten using the first alternative, and the remaining bytes by the second.
If $bare ends up with a value > 0, then it's not UTF-8. If the string doesn't contain any bytes with character code >= 128, then it doesn't matter which you choose. Both $bare and $utf8 will be zero, in that case.
| [reply] [d/l] [select] |
|
<off_topic>If it is that easy, how come my MS Internet Explorer miserably fails to automatically recognize the fact that some files are Unicode and I get all kinds of weird characters on my screen?</off_topic>
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] |
|
Probably because Microsoft stopped the insanity at examining the whole file for character set instead of just examining it for the content type. Not to mention the difficult in trying to figure out the encoding automatically. There is a big difference between "this is invalid UTF-8 so it must Latin1" and "this weird stuff must be EUC-KR".
Not to mention, saying a file is Unicode does not specify the encoding. There are multiple encodings for Unicode, and most non-Unicode encodings can be mapped to Unicode, as long as they are declared.
| [reply] |
|
| [reply] [d/l] |
Re: Guess between UTF8 and Latin1/ISO-8859-1
by Joost (Canon) on Jan 21, 2004 at 20:57 UTC
|
A couple of pointers:
Personally and professionally, I take the stance that any XML file that doesn't start with the endianness 2-byte code is NOT unicode. Anything within that group that doesn't say something about it's encoding in the text declaration will be interpreted as being 7-bit ASCII, and any character entities I encounter that exceed the 7-bit range (like Í or whatever) are invalid and the whole file is rejected unless clear agreements have been made about the actual encoding of the content.
If you make any other assumptions you will be miserable later. I know I have :-/
The numeric entities can open up a can of worms that's hard to close after the fact: you can decide on an ASCII encoded XML file, but the actual content can be unicode, LATIN-1, japanese or whatever, so you need to decide on the encoding of the content seperately. (Please someone, correct me if I'm wrong. This has been bugging me for too long)
Just my €0.02.
Joost.
| [reply] |
|
| [reply] |
|
| [reply] |
|
The whole point of this is that I do not want to reject stuff I don't have to.
But you will have to reject something; it's not possible to just guess the encoding (and be correct all the time, that is)... I'd go for being pedantic, and just contact them, saying they don't follow the standard and try to shame them into fixing it.
Joost.
| [reply] |
|
Re: Guess between UTF8 and Latin1/ISO-8859-1
by hardburn (Abbot) on Jan 21, 2004 at 20:35 UTC
|
In perluniintro, under the "Questions with Answers" section, there is an example of how to check if a string contains Unicode. It comes with a big warning that you really don't want to do this . . .
---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
-- Schemer
: () { :|:& };:
Note: All code is untested, unless otherwise stated
| [reply] [d/l] |
|
It seems you meant the response to the "How Do I Know Whether My String Is In Unicode?" question, right? Well I don't care whether Perl thinks the string is unicode (I know it does not), I want to know whether the string of bytes is "could be" UTF-8. Anyway the later answers seem to be what I need. I did try the pack() solution and it seems to be working fine.
I'll try several ways suggested in that manpage and by other responders and come back with some benchmarks :-)
Jenda
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
-- Rick Osborne
Edit by castaway: Closed small tag in signature
| [reply] |
Re: Guess between UTF8 and Latin1/ISO-8859-1
by ysth (Canon) on Jan 21, 2004 at 21:10 UTC
|
Assuming perl thinks it's utf8 data to begin with, you can catch the "invalid" warnings before they happen
with something like (untested) Encode::_utf8_off($str) if !utf8::valid($str)
Update: Caveat programmer: don't ever use _utf8_off or _utf8_on except where you know perl has the utf8 flag wrong. | [reply] [d/l] |
Re: Guess between UTF8 and Latin1/ISO-8859-1
by BrowserUk (Patriarch) on Jan 21, 2004 at 22:26 UTC
|
| [reply] |
Re: Guess between UTF8 and Latin1/ISO-8859-1
by Jenda (Abbot) on Jan 21, 2004 at 22:23 UTC
|
I tried three options. One using the pack('U0U*",...) solution from perluniintro, second using the regexps suggested by bart and a Encode::decode_utf8() using solution from also from peruniintro. The decode_utf8() is the fastest by far: my $test = 0;
use warnings;
use bytes;
use Benchmark;
use Encode qw(encode_utf8 decode_utf8);
my $xml;
{
my $isUTF = 1;
my $sub = sub {$isUTF = 0};
sub byPack {
$SIG{__WARN__} = $sub;
no warnings 'void';
my @a=unpack( 'U0U*', $xml);
delete $SIG{__WARN__};
return $isUTF;
}
}
sub byRegExp {
my $bad_utf8 = 0;
while($xml =~ /(?=[\x80-\xFF])(?:[\xC0-\xFF][\x80-\xBF]+|(.))/g an
+d !$bad_utf8) {
$bad_utf8++ if defined $1;
}
return !$bad_utf8;
}
sub byDecode {
if (decode_utf8($xml)) {
return 1
} else {
return 0
}
}
print "OK\n";
open XML, '<test-ok.xml';
$xml = do {local $/; <XML>};
close XML;
if ($test) {
print "byPack=".byPack()."\n";
print "byRegExp=".byRegExp()."\n";
print "byDecode=".byDecode()."\n";
} else {
timethese (10000, {
byPack => \&byPack,
byRegExp => \&byRegExp,
byDecode => \&byDecode,
});
}
print "BAD\n";
open XML, '<test-bad.xml';
$xml = do {local $/; <XML>};
close XML;
if ($test) {
print "byPack=".byPack()."\n";
print "byRegExp=".byRegExp()."\n";
print "byDecode=".byDecode()."\n";
} else {
timethese (10000, {
byPack => \&byPack,
byRegExp => \&byRegExp,
byDecode => \&byDecode,
});
}
__END__
OK
Benchmark: timing 10000 iterations of byDecode, byPack, byRegExp...
byDecode: 0 wallclock secs ( 0.22 usr + 0.00 sys = 0.22 CPU) @ 45
+662.10/s (n=10000)
(warning: too few iterations for a reliable count)
byPack: 15 wallclock secs (15.17 usr + 0.00 sys = 15.17 CPU) @ 65
+9.11/s (n=10000)
byRegExp: 5 wallclock secs ( 4.22 usr + 0.00 sys = 4.22 CPU) @ 23
+70.79/s (n=10000)
BAD
Benchmark: timing 10000 iterations of byDecode, byPack, byRegExp...
byDecode: 0 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU) @ 12
+8205.13/s (n=10000)
(warning: too few iterations for a reliable count)
byPack: 15 wallclock secs (15.42 usr + 0.00 sys = 15.42 CPU) @ 64
+8.42/s (n=10000)
byRegExp: 5 wallclock secs ( 4.25 usr + 0.00 sys = 4.25 CPU) @ 23
+52.94/s (n=10000)
The tests were run with two 4KB XMLs, the bad one had an í character added approximately in the middle.
Jenda
Always code as if the guy who ends up maintaining your code
will be a violent psychopath who knows where you live.
-- Rick Osborne
Edit by castaway: Closed small tag in signature | [reply] [d/l] |
Re: Guess between UTF8 and Latin1/ISO-8859-1
by g00n (Hermit) on Jan 21, 2004 at 22:49 UTC
|
reading through the pod source files (like on does when developing) I came across this in perlpodspec.pod.
I've included the text verbatim from the link as it highlights (I think), insight into the problem. It reads ...
Since Perl recognizes a Unicode Byte Order Mark at the start of files
as signaling that the file is Unicode encoded as in UTF-16 (whether
big-endian or little-endian) or UTF-8, Pod parsers should do the
same. Otherwise, the character encoding should be understood as
being UTF-8 if the first highbit byte sequence in the file seems
valid as a UTF-8 sequence, or otherwise as Latin-1 ...
... A naive but sufficient heuristic for testing the first highbit
byte-sequence in a BOM-less file (whether in code or in Pod!), to see
whether that sequence is valid as UTF-8 (RFC 2279) is to check whether
that the first byte in the sequence is in the range 0xC0 - 0xFD
I whether the next byte is in the range
0x80 - 0xBF. If so, the parser may conclude that this file is in
UTF-8, and all highbit sequences in the file should be assumed to
be UTF-8. Otherwise the parser should treat the file as being
in Latin-1. In the unlikely circumstance that the first highbit
sequence in a truly non-UTF-8 file happens to appear to be UTF-8, one
can cater to our heuristic (as well as any more intelligent heuristic)
by prefacing that line with a comment line containing a highbit
sequence that is clearly I valid as UTF-8. A line consisting
of simply "#", an e-acute, and any non-highbit byte,
is sufficient to establish this file's encoding.
from this you should be able to work out UTF-8/Latin-1.
| [reply] |
Re: Guess between UTF8 and Latin1/ISO-8859-1
by graff (Chancellor) on Jan 22, 2004 at 03:29 UTC
|
if there are no bytes with the 8th bit set then
there's no problem -- nevermind
else
if ( any bytes match /[\xc0\xc1\xc4-\xff]/, or
an odd number of bytes match /[\x80-\xff]/ ) then
it must be Latin1
else
make a copy
delete everything that could be utf8 forms of Latin1 characters:
s/\xc2[\xa0-\xbf]|\xc3[\x80-\xbf]//g;
if this removes all bytes with 8th-bit set, then
the original data is almost certainly utf8
else
the original data is definitely Latin1
Now, if any of your assurances (assumptions?) happen to be wrong -- e.g. if there is "noise" in the data, causing a few non-ASCII values to appear "unintentionally", or if Latin1 is not the only single-byte encoding that might be used, or if utf8 encoding is being used and the data happens to include some unicode characters that are outside the Latin1 range (I've seen this rather often, where Word or some equally clever app uses stuff in the U2000 range for "recommended forms" of certain punctuation marks -- why these are recommended escapes me at the moment). If any of that could be true for your data, then this simple decision tree could be misleading.
(That last contingency, finding utf8 code points that don't map to Latin1, could be handled if you apply bart's more broadly scoped means for detecting things that look like utf8.)
Update: I adjusted the regex for matching things that look like utf8 renderings of Latin1 characters -- it used to be /[\xc2\xc3][\x80-\xbf]/ which was a bit broader than it needed to be for the situation described in the OP. In utf8, the sequence of byte pairs "\xc2\x80" thru "\xc2\x9f" would map to "\x80" thru "\xbf" in Latin1, which do not represent any printable characters. (This fact alone might motivate a check such as
if any bytes match /[\x80-\x9f]/ then
it's pretty sure not to be Latin1
but again, whether this would be enough to conclude that it must be utf8 is just a matter of how much you trust your data, and your knowledge of it.)
One more update: while those byte-level tests are kinda neat, I think I would end up prefering a simpler, two step approach (which I think someone else must have mentioned by now):
eval "\$_ = decode('utf8',\$orig_data,Encode::FB_CROAK)";
if ($@) {
# it's not utf8, and so must be iso-8859-1
}
| [reply] [d/l] [select] |
Re: Guess between UTF8 and Latin1/ISO-8859-1
by kamal tejnani (Initiate) on Jan 22, 2004 at 05:18 UTC
|
Hi,
I faced a similar problem when I was doing a project for a client. The hard part is there _seems_ to be no solution.
Instead, what we did was, converted each XML file into a pattern that seemed to be in tune with the rest of the design of the software i.e. instead of tweaking and getting the error code as you have suggested, we made each XML in UTF8 format by giving it the proper header. Many reasons for doing that:- UTF8 will be the standard, the encoding recognised by the Lib modules that we have to include with our perl scripts, etc. Also, one has to take into consideration that we may have to change the encoding of the browser and the editor so that they are in tune with the encoding that we have chosen. This could be important if you are checking and debugging or in general playing around with the different formats.
| [reply] |
|
|