Just another Perl shrine | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Encode::Detect is my new best friend, thanks to Anonymous Monk.Then you don't know it very well. It’s really rather poor. Here are the failure results on a small sampling of real-world test files: I have a working snapshot of a module that actually does work right on such things called Encode::Guess::Educated. It has no noncore dependencies. It is designed to detect the encoding of English-language biomedical research papers. It can reliably detect not merely ASCII and UTF-{8,16,32}, but also the very conflicting 8-bit encodings. The reason it can do this is that it works off a training model. I looked at three different corpora to do this: one containing 3½M non-ASCII codepoints, one containing 14M of them, and one containing 29M of them. It makes an educated guest based on conformance to a particular model. And it does very well. Right now it has only a CLI API and an OO API, no Export-based one. Here’s the easiest way to use the CLI API, via a simple program called gank: The underlying class’s default training model derives from the complete PubMed Open Access corpus, and it therefore attains an extremely high measured accuracy of 99.79% when used on English-language biomedical texts. It also does well on other texts using any Latin-based alphabet. I have comparative statistics using two alternate training models, but the PMCOA model is fine for most purposes. You may also give gank a -s option to give you a short ‘score-card’ of the various encodings it considered: EXPLANATION:
total bytes=15903, high bytes=22, distinct high bytes=8 *49.582509 +0.909655 cp1252 => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marquèze, Llinás. ScienceDirect® is" => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER E WITH GRAVE}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is" 49.557280 +0.909146 cp1250 => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marqučze, Llinás. ScienceDirect® is" => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER C WITH CARON}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is" 0.860211 -3.144560 MacRoman => "IñV, Copyright © 2001 Outline ï Acknowledgements 12000◊ g S¸dhof MarquËze, Llin·s. ScienceDirectÆ is" => "I\N{LATIN SMALL LETTER N WITH TILDE}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{LATIN SMALL LETTER I WITH DIAERESIS} Acknowledgements 12000\N{LOZENGE} g S\N{CEDILLA}dhof Marqu\N{LATIN CAPITAL LETTER E WITH DIAERESIS}ze, Llin\N{MIDDLE DOT}s. ScienceDirect\N{LATIN CAPITAL LETTER AE} is"I need to do more work on its API — this is just a proof of concept, although it does comes with a halfway decent test suite — and of course document it, but I’m hunkered down right now correcting page-proofs on Camel4, so I probably won’t get to sprucing up the module for another 7–10 days. --tom In reply to Re^3: Character encoding woes - unicode or not?
by tchrist
|
|