Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

What's the best way to detect character encodings? (Redux)

by Jim (Curate)
on Jun 10, 2013 at 04:36 UTC ( #1037983=perlquestion: print w/ replies, xml ) Need Help??
Jim has asked for the wisdom of the Perl Monks concerning the following question:

Two years ago, I posted What's the best way to detect character encodings, Windows-1252 v. UTF-8? to SoPW. I got plenty of helpful answers to my question then. Now, I need to solve essentially the same problem again, but with UTF-16/UTF-16LE/UTF-16BE added to the mix.

Is there a Perl module that will automatically detect text files in these character encodings and normalize them to UTF-8 with byte order marks?

  • ASCII
  • ISO-8859-1 (Latin 1)
  • Windows-1252 (ANSI)
  • UTF-8 (with or without a byte order mark)
  • UTF-16
  • UTF-16LE
  • UTF-16BE

For my purposes, I can assume that text in a single-byte "legacy" encoding (i.e., not Unicode) consisting solely of characters in the ranges 01-7F and A0-FF is ISO-8859-1. If it has characters in the ranges 80-9F as well, it's Windows-1252. In other words, I can pretend there's no such thing as C1 control codes. (This is what all modern web browsers do, and it's what's specified in the draft HTML5 specification.)

UPDATE: I also want to know which of the lowest common denominator encodings each text file is in. For example, a file that consists solely of bytes in the range 01-7F is, for my purposes, ASCII. Sure, it's also ISO-8859-1, Windows-1252, UTF-8, and dozens of other encodings besides. But it's strictly in the ASCII character encoding, so that's what I want it to be identified as.

Comment on What's the best way to detect character encodings? (Redux)
Re: What's the best way to detect character encodings? (Redux)
by gnosti (Friar) on Jun 10, 2013 at 05:02 UTC
    I've heard that detecting encodings is a hard problem.

      It gets difficult when it's arbitrarily any "legacy" character encoding you're trying to detect. For that problem, fancy algorithms that use character frequencies, n-grams, dictionary look-ups, and other methods are required. But I'm able to factor out most of this complexity because I know that any text in a "legacy" single-byte encoding is ASCII/ISO-8859-1/Windows-1252. If it's not, then the damage I do converting it to UTF-8 will be what the provider of the text files deserves on-account-of-because she didn't provide me Unicode text as she was supposed to do in the first place.

Re: What's the best way to detect character encodings? (Redux)
by jakeease (Friar) on Jun 10, 2013 at 08:11 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1037983]
Approved by vinoth.ree
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2014-08-01 09:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (0 votes), past polls