Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

What's the best way to detect character encodings? (Redux)

by Jim (Curate)
on Jun 10, 2013 at 04:36 UTC ( [id://1037983]=perlquestion: print w/replies, xml ) Need Help??

Jim has asked for the wisdom of the Perl Monks concerning the following question:

Two years ago, I posted What's the best way to detect character encodings, Windows-1252 v. UTF-8? to SoPW. I got plenty of helpful answers to my question then. Now, I need to solve essentially the same problem again, but with UTF-16/UTF-16LE/UTF-16BE added to the mix.

Is there a Perl module that will automatically detect text files in these character encodings and normalize them to UTF-8 with byte order marks?

  • ASCII
  • ISO-8859-1 (Latin 1)
  • Windows-1252 (ANSI)
  • UTF-8 (with or without a byte order mark)
  • UTF-16
  • UTF-16LE
  • UTF-16BE

For my purposes, I can assume that text in a single-byte "legacy" encoding (i.e., not Unicode) consisting solely of characters in the ranges 01-7F and A0-FF is ISO-8859-1. If it has characters in the ranges 80-9F as well, it's Windows-1252. In other words, I can pretend there's no such thing as C1 control codes. (This is what all modern web browsers do, and it's what's specified in the draft HTML5 specification.)

UPDATE: I also want to know which of the lowest common denominator encodings each text file is in. For example, a file that consists solely of bytes in the range 01-7F is, for my purposes, ASCII. Sure, it's also ISO-8859-1, Windows-1252, UTF-8, and dozens of other encodings besides. But it's strictly in the ASCII character encoding, so that's what I want it to be identified as.

  • Comment on What's the best way to detect character encodings? (Redux)

Replies are listed 'Best First'.
Re: What's the best way to detect character encodings? (Redux)
by jakeease (Friar) on Jun 10, 2013 at 08:11 UTC
Re: What's the best way to detect character encodings? (Redux)
by gnosti (Chaplain) on Jun 10, 2013 at 05:02 UTC
    I've heard that detecting encodings is a hard problem.

      It gets difficult when it's arbitrarily any "legacy" character encoding you're trying to detect. For that problem, fancy algorithms that use character frequencies, n-grams, dictionary look-ups, and other methods are required. But I'm able to factor out most of this complexity because I know that any text in a "legacy" single-byte encoding is ASCII/ISO-8859-1/Windows-1252. If it's not, then the damage I do converting it to UTF-8 will be what the provider of the text files deserves on-account-of-because she didn't provide me Unicode text as she was supposed to do in the first place.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1037983]
Approved by vinoth.ree
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-19 05:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found