|Perl: the Markov chain saw|
Re: Guess between UTF8 and Latin1/ISO-8859-1by g00n (Hermit)
|on Jan 21, 2004 at 22:49 UTC||Need Help??|
reading through the pod source files (like on does when developing) I came across this in perlpodspec.pod. I've included the text verbatim from the link as it highlights (I think), insight into the problem. It reads ...
Since Perl recognizes a Unicode Byte Order Mark at the start of files as signaling that the file is Unicode encoded as in UTF-16 (whether big-endian or little-endian) or UTF-8, Pod parsers should do the same.
Otherwise, the character encoding should be understood as being UTF-8 if the first highbit byte sequence in the file seems valid as a UTF-8 sequence, or otherwise as Latin-1 ...
... A naive but sufficient heuristic for testing the first highbit byte-sequence in a BOM-less file (whether in code or in Pod!), to see whether that sequence is valid as UTF-8 (RFC 2279) is to check whether that the first byte in the sequence is in the range 0xC0 - 0xFD I whether the next byte is in the range 0x80 - 0xBF. If so, the parser may conclude that this file is in UTF-8, and all highbit sequences in the file should be assumed to be UTF-8.
Otherwise the parser should treat the file as being in Latin-1. In the unlikely circumstance that the first highbit sequence in a truly non-UTF-8 file happens to appear to be UTF-8, one can cater to our heuristic (as well as any more intelligent heuristic) by prefacing that line with a comment line containing a highbit sequence that is clearly I valid as UTF-8.
A line consisting of simply "#", an e-acute, and any non-highbit byte, is sufficient to establish this file's encoding.
from this you should be able to work out UTF-8/Latin-1.