What encoding am I (probably) using?

tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

O wise monks,

Let's say I want to process some text whose encoding is uncertain, except that it is probably text, and probably in a Western (1 byte character) language. I want to do some text processing on it such as, extract all words from it. Before doing anything, I want to use

Encode::from_to($line,"$probable_encoding",''iso-8859-1'')
[download]

to put everything into iso-8859-1 in (probable) good form.

Is there anything I can use that will give me the "probable encoding" for a file / string / whatever?

I was led in this direction by the venerable Thundergnat's answer to my

matching german characters output from system call.

where he suggested I run Encode::from_to($latinresult, 'cp437', 'iso-8859-1'); before matching the output of a system call on my german WinXP box. But how did he know to use 'cp437'?

UPDATE: Thanks monks, Encode::Guess looks good. I'm going to go try it out.

Comment on What encoding am I (probably) using? Select or Download Code

Replies are listed 'Best First'.
Re: What encoding am I (probably) using? by mlh2003 (Scribe) on May 13, 2005 at 13:00 UTC
You might want to check out the Encode::Guess module... _______ Code is untested unless explicitly stated mlh2003	[reply]
Re: What encoding am I (probably) using? by thundergnat (Deacon) on May 13, 2005 at 13:02 UTC
Is there anything I can use that will give me the "probable encoding" for a file / string / whatever? I suspect that Encode::Guess might be your best bet. See `perldoc Encode::Guess` [download] I am not sure how well it deals with DOS code page encodings though, especially if there may be several possibilities. But how did he know to use 'cp437'? I went to Google, entered: DOS code page, and pressed "I'm feeling lucky!" ;-)	[reply] [d/l]
Re: What encoding am I (probably) using? by ysth (Canon) on May 13, 2005 at 13:25 UTC
UPDATE: Thanks monks, Encode::Guess looks good. I'm going to go try it out. A quick read of the doc does not seem to indicate this would guess well among 1-byte encodings. Ah, yes; it says: CAVEATS Because of the algorithm used, ISO-8859 series and other single-byte encodings do not work well unless either one of ISO-8859 is the only one suspect (besides ascii and utf8).	[reply]
Re^2: What encoding am I (probably) using? by tphyahoo (Vicar) on May 13, 2005 at 13:58 UTC
Yes, unfortunately my experience so far seems to bear this out. After mucking around for a while I came up with the following code, which doesn't really solve anything but perhaps may inspire one wiser than me to share a better solution... use warnings; use strict; use PPM::Repositories; use Encode::Guess; # OS Call on German WinXP my $result = `ping -n 1 jenda.krinicky.cz ` . "\n"; my $encoding; #works, as expected. print "cp437:\n"; $encoding = guess_encoding_cp437($result); if ( ref( $encoding ) ) { test_ping_result($result, $encoding->name); } else { print "Couldn't guess encoding.\n"; } #doesn't work print "default:\n"; $encoding = guess_encoding_default($result); if ( ref( $encoding ) ) { test_ping_result($result, $encoding->name); } else { print "Couldn't guess encoding.\n"; } #doesn't work. print "kitchen sink:\n"; $encoding = guess_encoding_default($result); if ( ref( $encoding ) ) { test_ping_result($result, $encoding->name); } else { print "Couldn't guess encoding.\n"; } sub test_ping_result { my $result = shift; my $encoding = shift; Encode::from_to($result,"$encoding",'iso-8859-1'); print "encoding: $encoding\n"; print "result: $result\n"; if ($result =~ /Überprüfen/) { # should match but fails because of + german characters print "Ping timed out \n"; } else { #good repository. print "Ping ok \n"; } } sub guess_encoding_cp437 { my $data = shift; my $enc = guess_encoding($data, ('cp437')); return $enc; } sub guess_encoding_default { my $data = shift; my $enc = guess_encoding($data); return $enc; } sub guess_encoding_kitchen_sink { my $data = shift; my $enc = guess_encoding($data, ( Encode->encodings() ) ); return $enc; } __END__ Outputs: cp437: encoding: cp437 result: Ping-Anforderung konnte Host "jenda.krinicky.cz" nicht finden. + Überprüfen Sie den Namen, und versuchen Sie es erneut. Ping timed out default: Couldn't guess encoding. kitchen sink: Couldn't guess encoding. [download]	[reply] [d/l]
Re: What encoding am I (probably) using? by graff (Chancellor) on Sep 21, 2005 at 15:15 UTC
Sorry to be responding so late on this -- maybe you've already worked out everything I going to say, but I'll say it anyway. I want to use Encode::from_to(...) to put everything into iso-8859-1 in (probable) good form. No. If you're expecting to pull in data from various web sites that might use several different single-byte legacy encodings, most of them will not be directly mappable to iso-8859-1. The whole problem with the legacy single-byte encodings is that, to the extent they differ from one another, you cannot map from one to another without losing some characters. Actually, to the extent that some 8-bit encodings cover fewer displayable characters than others (e.g. iso-8859-* never use 0x80-0x9f for displayable characters, whereas the Windows and Mac code pages always do), loss of information might only happen in one direction. But if your "from" encoding happens to be 8859-2 and your "to" encoding happens to be 8859-1, the conversion simply cannot work. So, always convert from some non-unicode encoding to utf8. As for guessing correctly from among several 8-bit code pages that cover different latin-alphabet-based languages, the sad truth remains that Encode::Guess will have a hard time getting it right. You need a certain amount of language modeling data (validated by manual inspection and labeling as to language and character set) and some simple statistics on your unknown input data in order to make a proper guess.	[reply]


laziness, impatience, and hubris
	PerlMonks