Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: What's the best way to detect character encodings? (Redux)

by Jim (Curate)
on Jun 10, 2013 at 05:12 UTC ( #1037989=note: print w/ replies, xml ) Need Help??


in reply to Re: What's the best way to detect character encodings? (Redux)
in thread What's the best way to detect character encodings? (Redux)

It gets difficult when it's arbitrarily any "legacy" character encoding you're trying to detect. For that problem, fancy algorithms that use character frequencies, n-grams, dictionary look-ups, and other methods are required. But I'm able to factor out most of this complexity because I know that any text in a "legacy" single-byte encoding is ASCII/ISO-8859-1/Windows-1252. If it's not, then the damage I do converting it to UTF-8 will be what the provider of the text files deserves on-account-of-because she didn't provide me Unicode text as she was supposed to do in the first place.


Comment on Re^2: What's the best way to detect character encodings? (Redux)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1037989]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (13)
As of 2015-07-30 12:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (271 votes), past polls