Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Guess between UTF8 and Latin1/ISO-8859-1

by Jenda (Abbot)
on Jan 21, 2004 at 20:26 UTC ( [id://322988]=perlquestion: print w/replies, xml ) Need Help??

Jenda has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone have some code that could guess whether some text (bytes) are Latin1 or UTF8? These are the only options I need to distinguish so a regexp or something that would say "this can't be UTF8" would be just fine.

We get some XML to import from several different companies (new ones being added from time to time). Quite often I find out later that even though the XML either doesn't specify the encoding or claims to be UTF-8 it's actually Latin1. Which means that as soon as there are some accentuated or fancy characters the XML is rejected with an "not well-formed (invalid token)" message. (MS Word loves to convert quotes, ampersands and dashes to some extended chars).

Of course the proper solution is to force the other side to either convert the stuff to UTF-8 or change the XML header, but that often takes some time on their end and the clients are not happy in the meantime.

I know I can catch the "invalid token" error, tweak the XML header and try to parse the XML again. I'd like to try to find out before I start the parsing.

Thanks, Jenda
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
   -- Rick Osborne

  • Comment on Guess between UTF8 and Latin1/ISO-8859-1

Replies are listed 'Best First'.
Re: Guess between UTF8 and Latin1/ISO-8859-1
by bart (Canon) on Jan 21, 2004 at 21:00 UTC
    Sure. Using byte-wise processing, all UTF-8 characters with character code >= 128 must match the following pattern:
    /[\xC0-\xFF][\x80-\xBF]+/
    (Actually you can even put more stringent constraints on the byte sequence, but this will do for a start.)

    It means that if you encounter anything matching /[\x80-\xFF]/ outside what's matched by the above pattern, it's not (valid) UTF-8. You can do this, for example, by using this:

    my($utf8, $bare) = (0, 0); use bytes; while(/(?=[\x80-\xFF])(?:[\xC0-\xFF][\x80-\xBF]+|(.))/g) { $bare++ if defined $1; $utf8++ unless defined $1; } print <<"END" utf-8: $utf8 bare: $bare END

    The idea behind the pattern is that the properly formed UTF-8 characters are eaten using the first alternative, and the remaining bytes by the second.

    If $bare ends up with a value > 0, then it's not UTF-8. If the string doesn't contain any bytes with character code >= 128, then it doesn't matter which you choose. Both $bare and $utf8 will be zero, in that case.

      <off_topic>If it is that easy, how come my MS Internet Explorer miserably fails to automatically recognize the fact that some files are Unicode and I get all kinds of weird characters on my screen?</off_topic>

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        Probably because Microsoft stopped the insanity at examining the whole file for character set instead of just examining it for the content type. Not to mention the difficult in trying to figure out the encoding automatically. There is a big difference between "this is invalid UTF-8 so it must Latin1" and "this weird stuff must be EUC-KR".

        Not to mention, saying a file is Unicode does not specify the encoding. There are multiple encodings for Unicode, and most non-Unicode encodings can be mapped to Unicode, as long as they are declared.

        I don't think they're using Perl on IE. That pretty much explains everything. ;) Actually, those pages would work right if people bothered declaring which encoding they're using. So many standards... so little compliance.

        --
        Allolex

Re: Guess between UTF8 and Latin1/ISO-8859-1
by Joost (Canon) on Jan 21, 2004 at 20:57 UTC
    A couple of pointers: Personally and professionally, I take the stance that any XML file that doesn't start with the endianness 2-byte code is NOT unicode. Anything within that group that doesn't say something about it's encoding in the text declaration will be interpreted as being 7-bit ASCII, and any character entities I encounter that exceed the 7-bit range (like &#205; or whatever) are invalid and the whole file is rejected unless clear agreements have been made about the actual encoding of the content.

    If you make any other assumptions you will be miserable later. I know I have :-/

    The numeric entities can open up a can of worms that's hard to close after the fact: you can decide on an ASCII encoded XML file, but the actual content can be unicode, LATIN-1, japanese or whatever, so you need to decide on the encoding of the content seperately. (Please someone, correct me if I'm wrong. This has been bugging me for too long)

    Just my €0.02.
    Joost.

      The byte order mark is only used for the UTF-16 encoding, the two-byte Unicode encoding. UTF-8 is the default encoding if the byte order mark or encoding parameter is not present. You are correct that if the encoding is not specified, and the file is not valid UTF-8, then it is an error.

      Numeric entities are always Unicode characters. Unicode is the only character set used in XML. Different encodings can be specified, but they should be mapped into Unicode so the parser deals with the Unicode.

      Thanks for your comments. Anyway ...
      1. Sorry. We are not gettting any 2bytes and I don't think we are ever going to.
      2. I dont need to care about those at the moment. All those companies are US based. And I think it's quite likely that if some company uses something else that Latin1 they will know how to specify that in the XML properly. Or that they will use some accentuated characters in the test jobs so we do find out that something is wrong during testing.
      3. The whole point of this is that I do not want to reject stuff I don't have to. The imported stuff will be displayed to the users for review and modification so if we do screw up badly they will notice, but we want to do our best to get the data inside. Of course the scripts will notify me if they have to tweak the XML and I will push the other companies to fix their stuff, but the users should not notice anything if possible.
      4. Again in this case it's safe enough to assume that if the company knows what the heck are entities they will be able to specify the encoding properly.

      Jenda
      Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
         -- Rick Osborne

      Edit by castaway: Closed small tag in signature

Re: Guess between UTF8 and Latin1/ISO-8859-1
by hardburn (Abbot) on Jan 21, 2004 at 20:35 UTC

    In perluniintro, under the "Questions with Answers" section, there is an example of how to check if a string contains Unicode. It comes with a big warning that you really don't want to do this . . .

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

      It seems you meant the response to the "How Do I Know Whether My String Is In Unicode?" question, right? Well I don't care whether Perl thinks the string is unicode (I know it does not), I want to know whether the string of bytes is "could be" UTF-8. Anyway the later answers seem to be what I need. I did try the pack() solution and it seems to be working fine.

      I'll try several ways suggested in that manpage and by other responders and come back with some benchmarks :-)

      Jenda
      Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
         -- Rick Osborne

      Edit by castaway: Closed small tag in signature

Re: Guess between UTF8 and Latin1/ISO-8859-1
by ysth (Canon) on Jan 21, 2004 at 21:10 UTC
    Assuming perl thinks it's utf8 data to begin with, you can catch the "invalid" warnings before they happen with something like (untested) Encode::_utf8_off($str) if !utf8::valid($str) Update: Caveat programmer: don't ever use _utf8_off or _utf8_on except where you know perl has the utf8 flag wrong.
Re: Guess between UTF8 and Latin1/ISO-8859-1
by BrowserUk (Patriarch) on Jan 21, 2004 at 22:26 UTC
Re: Guess between UTF8 and Latin1/ISO-8859-1
by Jenda (Abbot) on Jan 21, 2004 at 22:23 UTC

    I tried three options. One using the pack('U0U*",...) solution from perluniintro, second using the regexps suggested by bart and a Encode::decode_utf8() using solution from also from peruniintro. The decode_utf8() is the fastest by far:

    my $test = 0; use warnings; use bytes; use Benchmark; use Encode qw(encode_utf8 decode_utf8); my $xml; { my $isUTF = 1; my $sub = sub {$isUTF = 0}; sub byPack { $SIG{__WARN__} = $sub; no warnings 'void'; my @a=unpack( 'U0U*', $xml); delete $SIG{__WARN__}; return $isUTF; } } sub byRegExp { my $bad_utf8 = 0; while($xml =~ /(?=[\x80-\xFF])(?:[\xC0-\xFF][\x80-\xBF]+|(.))/g an +d !$bad_utf8) { $bad_utf8++ if defined $1; } return !$bad_utf8; } sub byDecode { if (decode_utf8($xml)) { return 1 } else { return 0 } } print "OK\n"; open XML, '<test-ok.xml'; $xml = do {local $/; <XML>}; close XML; if ($test) { print "byPack=".byPack()."\n"; print "byRegExp=".byRegExp()."\n"; print "byDecode=".byDecode()."\n"; } else { timethese (10000, { byPack => \&byPack, byRegExp => \&byRegExp, byDecode => \&byDecode, }); } print "BAD\n"; open XML, '<test-bad.xml'; $xml = do {local $/; <XML>}; close XML; if ($test) { print "byPack=".byPack()."\n"; print "byRegExp=".byRegExp()."\n"; print "byDecode=".byDecode()."\n"; } else { timethese (10000, { byPack => \&byPack, byRegExp => \&byRegExp, byDecode => \&byDecode, }); } __END__ OK Benchmark: timing 10000 iterations of byDecode, byPack, byRegExp... byDecode: 0 wallclock secs ( 0.22 usr + 0.00 sys = 0.22 CPU) @ 45 +662.10/s (n=10000) (warning: too few iterations for a reliable count) byPack: 15 wallclock secs (15.17 usr + 0.00 sys = 15.17 CPU) @ 65 +9.11/s (n=10000) byRegExp: 5 wallclock secs ( 4.22 usr + 0.00 sys = 4.22 CPU) @ 23 +70.79/s (n=10000) BAD Benchmark: timing 10000 iterations of byDecode, byPack, byRegExp... byDecode: 0 wallclock secs ( 0.08 usr + 0.00 sys = 0.08 CPU) @ 12 +8205.13/s (n=10000) (warning: too few iterations for a reliable count) byPack: 15 wallclock secs (15.42 usr + 0.00 sys = 15.42 CPU) @ 64 +8.42/s (n=10000) byRegExp: 5 wallclock secs ( 4.25 usr + 0.00 sys = 4.25 CPU) @ 23 +52.94/s (n=10000)

    The tests were run with two 4KB XMLs, the bad one had an í character added approximately in the middle.

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

    Edit by castaway: Closed small tag in signature

Re: Guess between UTF8 and Latin1/ISO-8859-1
by g00n (Hermit) on Jan 21, 2004 at 22:49 UTC

    reading through the pod source files (like on does when developing) I came across this in perlpodspec.pod. I've included the text verbatim from the link as it highlights (I think), insight into the problem. It reads ...

      Since Perl recognizes a Unicode Byte Order Mark at the start of files as signaling that the file is Unicode encoded as in UTF-16 (whether big-endian or little-endian) or UTF-8, Pod parsers should do the same.

      Otherwise, the character encoding should be understood as being UTF-8 if the first highbit byte sequence in the file seems valid as a UTF-8 sequence, or otherwise as Latin-1 ...

      ... A naive but sufficient heuristic for testing the first highbit byte-sequence in a BOM-less file (whether in code or in Pod!), to see whether that sequence is valid as UTF-8 (RFC 2279) is to check whether that the first byte in the sequence is in the range 0xC0 - 0xFD I whether the next byte is in the range 0x80 - 0xBF. If so, the parser may conclude that this file is in UTF-8, and all highbit sequences in the file should be assumed to be UTF-8.

      Otherwise the parser should treat the file as being in Latin-1. In the unlikely circumstance that the first highbit sequence in a truly non-UTF-8 file happens to appear to be UTF-8, one can cater to our heuristic (as well as any more intelligent heuristic) by prefacing that line with a comment line containing a highbit sequence that is clearly I valid as UTF-8.

      A line consisting of simply "#", an e-acute, and any non-highbit byte, is sufficient to establish this file's encoding.

    from this you should be able to work out UTF-8/Latin-1.

Re: Guess between UTF8 and Latin1/ISO-8859-1
by graff (Chancellor) on Jan 22, 2004 at 03:29 UTC
    I think your situation is actually a lot simpler than others here would have it. If it's really true that you are only dealing with characters in the "Latin1" range (you seem pretty confident about that), and if the only point of uncertainty about your data is whether it's utf8 or iso-8859-1 (and you really don't need to worry about any other possible alternative for using the upper-table), then you just need to test a particular set of conditions using byte semantics.

    The conditions can be stated in pseudo-code as follows:

    if there are no bytes with the 8th bit set then there's no problem -- nevermind else if ( any bytes match /[\xc0\xc1\xc4-\xff]/, or an odd number of bytes match /[\x80-\xff]/ ) then it must be Latin1 else make a copy delete everything that could be utf8 forms of Latin1 characters: s/\xc2[\xa0-\xbf]|\xc3[\x80-\xbf]//g; if this removes all bytes with 8th-bit set, then the original data is almost certainly utf8 else the original data is definitely Latin1
    Now, if any of your assurances (assumptions?) happen to be wrong -- e.g. if there is "noise" in the data, causing a few non-ASCII values to appear "unintentionally", or if Latin1 is not the only single-byte encoding that might be used, or if utf8 encoding is being used and the data happens to include some unicode characters that are outside the Latin1 range (I've seen this rather often, where Word or some equally clever app uses stuff in the U2000 range for "recommended forms" of certain punctuation marks -- why these are recommended escapes me at the moment). If any of that could be true for your data, then this simple decision tree could be misleading.

    (That last contingency, finding utf8 code points that don't map to Latin1, could be handled if you apply bart's more broadly scoped means for detecting things that look like utf8.)

    Update: I adjusted the regex for matching things that look like utf8 renderings of Latin1 characters -- it used to be  /[\xc2\xc3][\x80-\xbf]/ which was a bit broader than it needed to be for the situation described in the OP. In utf8, the sequence of byte pairs "\xc2\x80" thru "\xc2\x9f" would map to "\x80" thru "\xbf" in Latin1, which do not represent any printable characters. (This fact alone might motivate a check such as

    if any bytes match /[\x80-\x9f]/ then it's pretty sure not to be Latin1
    but again, whether this would be enough to conclude that it must be utf8 is just a matter of how much you trust your data, and your knowledge of it.)

    One more update: while those byte-level tests are kinda neat, I think I would end up prefering a simpler, two step approach (which I think someone else must have mentioned by now):

    eval "\$_ = decode('utf8',\$orig_data,Encode::FB_CROAK)"; if ($@) { # it's not utf8, and so must be iso-8859-1 }
Re: Guess between UTF8 and Latin1/ISO-8859-1
by kamal tejnani (Initiate) on Jan 22, 2004 at 05:18 UTC
    Hi, I faced a similar problem when I was doing a project for a client. The hard part is there _seems_ to be no solution. Instead, what we did was, converted each XML file into a pattern that seemed to be in tune with the rest of the design of the software i.e. instead of tweaking and getting the error code as you have suggested, we made each XML in UTF8 format by giving it the proper header. Many reasons for doing that:- UTF8 will be the standard, the encoding recognised by the Lib modules that we have to include with our perl scripts, etc. Also, one has to take into consideration that we may have to change the encoding of the browser and the editor so that they are in tune with the encoding that we have chosen. This could be important if you are checking and debugging or in general playing around with the different formats.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://322988]
Approved by xenchu
Front-paged by ysth
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-04-24 22:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found