Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

What's the best way to detect character encodings? (Redux)

by Jim (Curate)
on Jun 10, 2013 at 04:36 UTC ( #1037983=perlquestion: print w/replies, xml ) Need Help??
Jim has asked for the wisdom of the Perl Monks concerning the following question:

Two years ago, I posted What's the best way to detect character encodings, Windows-1252 v. UTF-8? to SoPW. I got plenty of helpful answers to my question then. Now, I need to solve essentially the same problem again, but with UTF-16/UTF-16LE/UTF-16BE added to the mix.

Is there a Perl module that will automatically detect text files in these character encodings and normalize them to UTF-8 with byte order marks?

  • ISO-8859-1 (Latin 1)
  • Windows-1252 (ANSI)
  • UTF-8 (with or without a byte order mark)
  • UTF-16
  • UTF-16LE
  • UTF-16BE

For my purposes, I can assume that text in a single-byte "legacy" encoding (i.e., not Unicode) consisting solely of characters in the ranges 01-7F and A0-FF is ISO-8859-1. If it has characters in the ranges 80-9F as well, it's Windows-1252. In other words, I can pretend there's no such thing as C1 control codes. (This is what all modern web browsers do, and it's what's specified in the draft HTML5 specification.)

UPDATE: I also want to know which of the lowest common denominator encodings each text file is in. For example, a file that consists solely of bytes in the range 01-7F is, for my purposes, ASCII. Sure, it's also ISO-8859-1, Windows-1252, UTF-8, and dozens of other encodings besides. But it's strictly in the ASCII character encoding, so that's what I want it to be identified as.

  • Comment on What's the best way to detect character encodings? (Redux)

Replies are listed 'Best First'.
Re: What's the best way to detect character encodings? (Redux)
by jakeease (Friar) on Jun 10, 2013 at 08:11 UTC
Re: What's the best way to detect character encodings? (Redux)
by gnosti (Friar) on Jun 10, 2013 at 05:02 UTC
    I've heard that detecting encodings is a hard problem.

      It gets difficult when it's arbitrarily any "legacy" character encoding you're trying to detect. For that problem, fancy algorithms that use character frequencies, n-grams, dictionary look-ups, and other methods are required. But I'm able to factor out most of this complexity because I know that any text in a "legacy" single-byte encoding is ASCII/ISO-8859-1/Windows-1252. If it's not, then the damage I do converting it to UTF-8 will be what the provider of the text files deserves on-account-of-because she didn't provide me Unicode text as she was supposed to do in the first place.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1037983]
Approved by vinoth.ree
Front-paged by Old_Gray_Bear
[Eily]: can't the nodelet be moved though? Maybe you could put one that doesn't change first ("Sections" or "Find Nodes" for example)
[Eily]: "Other Users" seems like a poor choice :P
[Eily]: nope, Nodelet Settings doesn't let you move the XP Nodelet, CSS might
[marinersk]: That would mitigate the distraction/jangle issue, but then the information wouldn't be easy to find when it is populated. Plus, I don't currently see a way to move it, but I'm not done poking around on that point yet.

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (8)
As of 2017-05-29 14:07 GMT
Find Nodes?
    Voting Booth?