Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re: Guess between UTF8 and Latin1/ISO-8859-1

by g00n (Hermit)
on Jan 21, 2004 at 22:49 UTC ( #323051=note: print w/replies, xml ) Need Help??

in reply to Guess between UTF8 and Latin1/ISO-8859-1

reading through the pod source files (like on does when developing) I came across this in perlpodspec.pod. I've included the text verbatim from the link as it highlights (I think), insight into the problem. It reads ...

    Since Perl recognizes a Unicode Byte Order Mark at the start of files as signaling that the file is Unicode encoded as in UTF-16 (whether big-endian or little-endian) or UTF-8, Pod parsers should do the same.

    Otherwise, the character encoding should be understood as being UTF-8 if the first highbit byte sequence in the file seems valid as a UTF-8 sequence, or otherwise as Latin-1 ...

    ... A naive but sufficient heuristic for testing the first highbit byte-sequence in a BOM-less file (whether in code or in Pod!), to see whether that sequence is valid as UTF-8 (RFC 2279) is to check whether that the first byte in the sequence is in the range 0xC0 - 0xFD I whether the next byte is in the range 0x80 - 0xBF. If so, the parser may conclude that this file is in UTF-8, and all highbit sequences in the file should be assumed to be UTF-8.

    Otherwise the parser should treat the file as being in Latin-1. In the unlikely circumstance that the first highbit sequence in a truly non-UTF-8 file happens to appear to be UTF-8, one can cater to our heuristic (as well as any more intelligent heuristic) by prefacing that line with a comment line containing a highbit sequence that is clearly I valid as UTF-8.

    A line consisting of simply "#", an e-acute, and any non-highbit byte, is sufficient to establish this file's encoding.

from this you should be able to work out UTF-8/Latin-1.

  • Comment on Re: Guess between UTF8 and Latin1/ISO-8859-1

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://323051]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2018-06-23 10:37 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (125 votes). Check out past polls.