Re: Character encoding woes

Replies are listed 'Best First'.
Re^2: Character encoding woes - unicode or not? by japhy (Canon) on Jan 31, 2012 at 15:08 UTC
Encode::Detect is my new best friend, thanks to Anonymous Monk. Jeffrey Pinyan (Perl, PHP ugh, JavaScript) — @PrayingTheMass Melius servire volo Catholic Liturgy	[reply]
Re^3: Character encoding woes - unicode or not? by tchrist (Pilgrim) on Feb 02, 2012 at 13:13 UTC
Encode::Detect is my new best friend, thanks to Anonymous Monk. Then you don't know it very well. It’s really rather poor. Here are the failure results on a small sampling of real-world test files: Status Filename Right E::D::Detector ================================================== Wrong! 7843118 ascii UNABLE TO GUESS 9501430 utf8 UTF-8 10318897 cp1252 windows-1252 10329150 cp1252 windows-1252 Wrong! 10358003 MacRoman UNABLE TO GUESS Wrong! 10358042 MacRoman UNABLE TO GUESS Wrong! 10429209 MacRoman UNABLE TO GUESS 10482611 cp1252 windows-1252 Wrong! 10542098 MacRoman UNABLE TO GUESS 10617571 cp1252 windows-1252 Wrong! 10625668 iso-8859-1 windows-1252 10676968 cp1252 windows-1252 10677497 cp1252 windows-1252 10963661 MacRoman UNABLE TO GUESS 11042188 macRoman UNABLE TO GUESS 11212329 utf8 UTF-8 11287402 cp1252 windows-1252 Wrong! 11470876 MacRoman windows-1252 11842027 iso-8859-1 windows-1252 Wrong! 11940257 ascii UNABLE TO GUESS Wrong! 11972335 MacRoman UNABLE TO GUESS Wrong! 12091502 iso-8859-1 windows-1252 12169614 utf8 UTF-8 12495435 MacRoman windows-1252 12736309 MacRoman windows-1252 14641909 MacRoman windows-1252 14652344 utf8 UTF-8 14751857 cp1252 windows-1252 15037632 cp1252 windows-1252 15070898 cp1252 windows-1252 Wrong! 15154606 MacRoman windows-1252 15201223 cp1252 windows-1252 Wrong! 15315962 iso-8859-1 Big5 15328020 cp1252 windows-1252 Wrong! 17298172 MacRoman windows-1252 Wrong! 116059400 MacRoman windows-1252 [download] I have a working snapshot of a module that actually does work right on such things called Encode::Guess::Educated. It has no noncore dependencies. It is designed to detect the encoding of English-language biomedical research papers. It can reliably detect not merely ASCII and UTF-{8,16,32}, but also the very conflicting 8-bit encodings. The reason it can do this is that it works off a training model. I looked at three different corpora to do this: one containing 3½M non-ASCII codepoints, one containing 14M of them, and one containing 29M of them. It makes an educated guest based on conformance to a particular model. And it does very well. Right now it has only a CLI API and an OO API, no `Export`-based one. Here’s the easiest way to use the CLI API, via a simple program called gank: `$ gank 011526914.txt cp1252 $ gank 00.txt Sym.txt 0115.txt 001313968.txt: ascii 001328180.txt: utf8 007499277.txt: iso-8859-1 Symbola602.txt: UTF-16 011526914.txt: cp1252 011535589.txt: iso-8859-1 011570876.txt: MacRoman` [download] The underlying class’s default training model derives from the complete PubMed Open Access corpus, and it therefore attains an extremely high measured accuracy* of 99.79% when used on English-language biomedical texts. It also does well on other texts using any Latin-based alphabet. I have comparative statistics using two alternate training models, but the PMCOA model is fine for most purposes. You may also give gank a `-s` option to give you a short ‘score-card’ of the various encodings it considered: `91.718532 +2.285393 MacRoman 3.640513 -0.941206 iso-8859-1, iso-8859-15, cp1252 3.639257 -0.941552 cp1250 1.001698 -2.231634 iso-8859-2` [download] EXPLANATION: The first column is all scores normalized to 0..100. The second column is the natural log of the real score. The rest is which encoding have that score, and in the order of preference for breaking ties of the same score. I have it arranged so it says it’s the smallest subset that works; i.e., ascii < latin1 < cp1252, etc. There’s also a `-l` option to give you a long report that illustrates what each possible shoice would be if it were in that encoding, with paired lines of literal UTF-8 and `\N{...}` named characters. total bytes=15903, high bytes=22, distinct high bytes=8 49.582509 +0.909655 cp1252 => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marquèze, Llinás. ScienceDirect® is" => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER E WITH GRAVE}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is" 49.557280 +0.909146 cp1250 => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marqučze, Llinás. ScienceDirect® is" => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER C WITH CARON}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is" 0.860211 -3.144560 MacRoman => "IñV, Copyright © 2001 Outline ï Acknowledgements 12000◊ g S¸dhof MarquËze, Llin·s. ScienceDirectÆ is" => "I\N{LATIN SMALL LETTER N WITH TILDE}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{LATIN SMALL LETTER I WITH DIAERESIS} Acknowledgements 12000\N{LOZENGE} g S\N{CEDILLA}dhof Marqu\N{LATIN CAPITAL LETTER E WITH DIAERESIS}ze, Llin\N{MIDDLE DOT}s. ScienceDirect\N{LATIN CAPITAL LETTER AE} is" I need to do more work on its API — this is just a proof of concept, although it does comes with a halfway decent test suite — and of course document it, but I’m hunkered down right now correcting page-proofs on Camel4, so I probably won’t get to sprucing up the module for another 7–10 days. --tom	[reply] [d/l] [select]


XP is just a number
	PerlMonks

Re: Character encoding woes - unicode or not?