|Problems? Is your data what you think it is?|
Re: How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use?by graff (Chancellor)
|on Apr 21, 2013 at 23:31 UTC||Need Help??|
How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use?
I use trial-and-error. I first try to treat it as utf8; if that doesn't throw an error, I'm done. (Also, utf8 might be the most likely outcome anyway.) If the text is not uft8, trying to read it as utf8 will definitely fail, and I'll know for certain that it's some other encoding.
In the latter case, I hope I have some idea of what (human) language the text is supposed to contain, because that will guide how I check for other encodings.
For example, if the language is not Chinese, Japanese or Korean (CJK), the writing system will be one or another alphabet set, usually requiring less than 128 distinct code points; in this case, a UTF-16 encoding will have a rather lopsided byte histogram, because half the bytes (the ones for the upper 8 bits of each character) will have a very limited distribution of values: lots of nulls, and (depending on the language), lots of, say, 0x06 (if it's Arabic) or 0x04 (if it's Cyrillic), etc. Seeing whether these values occur at even or odd byte offsets will reveal whether the UTF-16 is BE or LE.
If the text is supposed to be CJK (and it isn't utf8), I'll go right to Encode::Guess. Likewise if the text is clearly not a 16-bit encoding (i.e. it's not CJK, not UTF-16, and not utf8).
You could probably rely more heavily on Encode::Guess for more of the scenarios, in order to reduce the manual effort. But there are bound to be cases where you really just need to have a human involved (ideally one who knows the language being used in the text).
Bigram statistics for each "language/encoding" tuple serves well as a discriminator, but this depends on having reliable training data for each tuple. If you happen to be dealing with a closed set of possible input types, and just need an automatic way to differentiate between them, you only need a few hundred KB of text per language/encoding tuple to get fairly distinctive bigram statistics.
In effect, in languages that use single-byte encodings, pair-wise byte sequences fall into fairly predictable rankings in terms of frequency of occurrence, and the rankings are distinct from one language to the next. Extending this to CJK would involve a larger quantity of training data, and/or doing statistics on 4-byte sequences (i.e. pairings of 16-bit characters).