Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

What's the best way to detect character encodings? (Redux)

by Jim (Curate)
on Jun 10, 2013 at 04:36 UTC ( #1037983=perlquestion: print w/ replies, xml ) Need Help??
Jim has asked for the wisdom of the Perl Monks concerning the following question:

Two years ago, I posted What's the best way to detect character encodings, Windows-1252 v. UTF-8? to SoPW. I got plenty of helpful answers to my question then. Now, I need to solve essentially the same problem again, but with UTF-16/UTF-16LE/UTF-16BE added to the mix.

Is there a Perl module that will automatically detect text files in these character encodings and normalize them to UTF-8 with byte order marks?

  • ASCII
  • ISO-8859-1 (Latin 1)
  • Windows-1252 (ANSI)
  • UTF-8 (with or without a byte order mark)
  • UTF-16
  • UTF-16LE
  • UTF-16BE

For my purposes, I can assume that text in a single-byte "legacy" encoding (i.e., not Unicode) consisting solely of characters in the ranges 01-7F and A0-FF is ISO-8859-1. If it has characters in the ranges 80-9F as well, it's Windows-1252. In other words, I can pretend there's no such thing as C1 control codes. (This is what all modern web browsers do, and it's what's specified in the draft HTML5 specification.)

UPDATE: I also want to know which of the lowest common denominator encodings each text file is in. For example, a file that consists solely of bytes in the range 01-7F is, for my purposes, ASCII. Sure, it's also ISO-8859-1, Windows-1252, UTF-8, and dozens of other encodings besides. But it's strictly in the ASCII character encoding, so that's what I want it to be identified as.

Comment on What's the best way to detect character encodings? (Redux)
Re: What's the best way to detect character encodings? (Redux)
by gnosti (Friar) on Jun 10, 2013 at 05:02 UTC
    I've heard that detecting encodings is a hard problem.

      It gets difficult when it's arbitrarily any "legacy" character encoding you're trying to detect. For that problem, fancy algorithms that use character frequencies, n-grams, dictionary look-ups, and other methods are required. But I'm able to factor out most of this complexity because I know that any text in a "legacy" single-byte encoding is ASCII/ISO-8859-1/Windows-1252. If it's not, then the damage I do converting it to UTF-8 will be what the provider of the text files deserves on-account-of-because she didn't provide me Unicode text as she was supposed to do in the first place.

Re: What's the best way to detect character encodings? (Redux)
by jakeease (Friar) on Jun 10, 2013 at 08:11 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1037983]
Approved by vinoth.ree
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (10)
As of 2014-12-19 07:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (72 votes), past polls