Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: What's the best way to detect character encodings? (Redux)

by Jim (Curate)
on Jun 10, 2013 at 05:12 UTC ( #1037989=note: print w/ replies, xml ) Need Help??


in reply to Re: What's the best way to detect character encodings? (Redux)
in thread What's the best way to detect character encodings? (Redux)

It gets difficult when it's arbitrarily any "legacy" character encoding you're trying to detect. For that problem, fancy algorithms that use character frequencies, n-grams, dictionary look-ups, and other methods are required. But I'm able to factor out most of this complexity because I know that any text in a "legacy" single-byte encoding is ASCII/ISO-8859-1/Windows-1252. If it's not, then the damage I do converting it to UTF-8 will be what the provider of the text files deserves on-account-of-because she didn't provide me Unicode text as she was supposed to do in the first place.


Comment on Re^2: What's the best way to detect character encodings? (Redux)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1037989]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (7)
As of 2014-09-01 13:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (9 votes), past polls