Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
go ahead... be a heretic
 
PerlMonks  

Unicode nightmare

by perlmonkey2 (Beadle)
on Jul 27, 2006 at 17:03 UTC ( #564168=perlquestion: print w/ replies, xml ) Need Help??
perlmonkey2 has asked for the wisdom of the Perl Monks concerning the following question:

I deal with a large amount of text from random sources from all over the world. The text is often in a legacy text encoding. Is there a way to automagically determine the encoding type to help ease the pain of encoding it to utf8?

Comment on Unicode nightmare
Re: Unicode nightmare
by allolex (Curate) on Jul 27, 2006 at 17:19 UTC

    You might want to have a look at the Encode:: namespace on CPAN.

      You might want to make a proper link to a concrete module next time.

      Encode-Detect

Re: Unicode nightmare
by rhesa (Vicar) on Jul 27, 2006 at 17:22 UTC
    Encode::Guess might help you on your way.

    In general, differentiating between various 8-bit character sets is a hairy problem. If you have nothing else to go on besides the text files, I suspect you're going to need clever heuristics. But try Encode::Guess first; it might be enough.

      ++. Exactly!

Re: Unicode nightmare
by ikegami (Pope) on Jul 27, 2006 at 17:32 UTC

    No. The encoding is what determines that 65 66 67 should be displayed as ABC (or something else). There's nothing attached to "65" that would indicate US-ASCII should be used.

    However, there are ways of determining the probable encoding.

    • Searching for the BOM of unicode encodings.
    • Eliminating characters sets based on the presence of non-printable or undefined characters.
    • Dictionary validation. Check if the text becomes readable when treated as a particular encoding.
    • Statistical approaches such as frequency analysis.

    Good luck!

Re: Unicode nightmare
by perlmonkey2 (Beadle) on Jul 27, 2006 at 17:57 UTC
    Rhesa, thanks for the Unicode::Guess module. I thought I'd poured over all the Encode namespace, but I missed that one. ikegami, thanks for letting me know what I'm in for. I had guessed that worst case scenario I was going to be doing a lot of eval's for thrown errors when using the wrong encoding. And since no human will be part of the process and I don't have any statistics, identifying a latin capital A with diaresis vs a greek capital delta will probably be impossible. The best I can hope for is to minimize the unintelligible characters.
Re: Unicode nightmare
by Thelonius (Curate) on Jul 28, 2006 at 03:04 UTC
    (1) Make sure you don't lose any metadata that comes with the text (e.g. charset parameter in MIME Content-type)

    (2) If your text includes the ESCAPE character, it may have ISO-2022 shift sequences in it which identifty the character set. All the registered character sets are at http://www.itscj.ipsj.or.jp/ISO-IR/. The actual escape codes are defined in each PDF file. There doesn't seem to be a comprehensive table anywhere on the internet! Note that when ISO registry #165 says that the escape sequence (for G2) is ESC 2/4 2/10 4/5, that means "\e\x24\x2A\x45". (Of course "\x24\x2A\x45" are the characters $ * E

    You don't have to understand about G0, G1, G2 to recognize the character sets, although you would to actually translate them to Unicode. I don't know if Encode handles ISO-2022 encoding generally. ICU handles the more commonly used parts of it.

    Some general character set links:

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://564168]
Approved by Hue-Bond
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (10)
As of 2014-04-23 19:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (553 votes), past polls