Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Unicode nightmare

by perlmonkey2 (Beadle)
on Jul 27, 2006 at 17:03 UTC ( #564168=perlquestion: print w/ replies, xml ) Need Help??
perlmonkey2 has asked for the wisdom of the Perl Monks concerning the following question:

I deal with a large amount of text from random sources from all over the world. The text is often in a legacy text encoding. Is there a way to automagically determine the encoding type to help ease the pain of encoding it to utf8?

Comment on Unicode nightmare
Re: Unicode nightmare
by allolex (Curate) on Jul 27, 2006 at 17:19 UTC

    You might want to have a look at the Encode:: namespace on CPAN.

      You might want to make a proper link to a concrete module next time.

      Encode-Detect

Re: Unicode nightmare
by rhesa (Vicar) on Jul 27, 2006 at 17:22 UTC
    Encode::Guess might help you on your way.

    In general, differentiating between various 8-bit character sets is a hairy problem. If you have nothing else to go on besides the text files, I suspect you're going to need clever heuristics. But try Encode::Guess first; it might be enough.

      ++. Exactly!

Re: Unicode nightmare
by ikegami (Pope) on Jul 27, 2006 at 17:32 UTC

    No. The encoding is what determines that 65 66 67 should be displayed as ABC (or something else). There's nothing attached to "65" that would indicate US-ASCII should be used.

    However, there are ways of determining the probable encoding.

    • Searching for the BOM of unicode encodings.
    • Eliminating characters sets based on the presence of non-printable or undefined characters.
    • Dictionary validation. Check if the text becomes readable when treated as a particular encoding.
    • Statistical approaches such as frequency analysis.

    Good luck!

Re: Unicode nightmare
by perlmonkey2 (Beadle) on Jul 27, 2006 at 17:57 UTC
    Rhesa, thanks for the Unicode::Guess module. I thought I'd poured over all the Encode namespace, but I missed that one. ikegami, thanks for letting me know what I'm in for. I had guessed that worst case scenario I was going to be doing a lot of eval's for thrown errors when using the wrong encoding. And since no human will be part of the process and I don't have any statistics, identifying a latin capital A with diaresis vs a greek capital delta will probably be impossible. The best I can hope for is to minimize the unintelligible characters.
Re: Unicode nightmare
by Thelonius (Curate) on Jul 28, 2006 at 03:04 UTC
    (1) Make sure you don't lose any metadata that comes with the text (e.g. charset parameter in MIME Content-type)

    (2) If your text includes the ESCAPE character, it may have ISO-2022 shift sequences in it which identifty the character set. All the registered character sets are at http://www.itscj.ipsj.or.jp/ISO-IR/. The actual escape codes are defined in each PDF file. There doesn't seem to be a comprehensive table anywhere on the internet! Note that when ISO registry #165 says that the escape sequence (for G2) is ESC 2/4 2/10 4/5, that means "\e\x24\x2A\x45". (Of course "\x24\x2A\x45" are the characters $ * E

    You don't have to understand about G0, G1, G2 to recognize the character sets, although you would to actually translate them to Unicode. I don't know if Encode handles ISO-2022 encoding generally. ICU handles the more commonly used parts of it.

    Some general character set links:

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://564168]
Approved by Hue-Bond
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (12)
As of 2014-09-17 10:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (71 votes), past polls