Re^2: Character encoding woes

in reply to Re: Character encoding woes - unicode or not?
in thread Character encoding woes - unicode or not?

Encode::Detect is my new best friend, thanks to Anonymous Monk.

Jeffrey Pinyan (Perl, PHP ugh, JavaScript) — @PrayingTheMass
Melius servire volo
Catholic Liturgy

Comment on Re^2: Character encoding woes - unicode or not?

Replies are listed 'Best First'.
Re^3: Character encoding woes - unicode or not? by tchrist (Pilgrim) on Feb 02, 2012 at 13:13 UTC
Encode::Detect is my new best friend, thanks to Anonymous Monk. Then you don't know it very well. It’s really rather poor. Here are the failure results on a small sampling of real-world test files: Status Filename Right E::D::Detector ================================================== Wrong! 7843118 ascii UNABLE TO GUESS 9501430 utf8 UTF-8 10318897 cp1252 windows-1252 10329150 cp1252 windows-1252 Wrong! 10358003 MacRoman UNABLE TO GUESS Wrong! 10358042 MacRoman UNABLE TO GUESS Wrong! 10429209 MacRoman UNABLE TO GUESS 10482611 cp1252 windows-1252 Wrong! 10542098 MacRoman UNABLE TO GUESS 10617571 cp1252 windows-1252 Wrong! 10625668 iso-8859-1 windows-1252 10676968 cp1252 windows-1252 10677497 cp1252 windows-1252 10963661 MacRoman UNABLE TO GUESS 11042188 macRoman UNABLE TO GUESS 11212329 utf8 UTF-8 11287402 cp1252 windows-1252 Wrong! 11470876 MacRoman windows-1252 11842027 iso-8859-1 windows-1252 Wrong! 11940257 ascii UNABLE TO GUESS Wrong! 11972335 MacRoman UNABLE TO GUESS Wrong! 12091502 iso-8859-1 windows-1252 12169614 utf8 UTF-8 12495435 MacRoman windows-1252 12736309 MacRoman windows-1252 14641909 MacRoman windows-1252 14652344 utf8 UTF-8 14751857 cp1252 windows-1252 15037632 cp1252 windows-1252 15070898 cp1252 windows-1252 Wrong! 15154606 MacRoman windows-1252 15201223 cp1252 windows-1252 Wrong! 15315962 iso-8859-1 Big5 15328020 cp1252 windows-1252 Wrong! 17298172 MacRoman windows-1252 Wrong! 116059400 MacRoman windows-1252 [download] I have a working snapshot of a module that actually does work right on such things called Encode::Guess::Educated. It has no noncore dependencies. It is designed to detect the encoding of English-language biomedical research papers. It can reliably detect not merely ASCII and UTF-{8,16,32}, but also the very conflicting 8-bit encodings. The reason it can do this is that it works off a training model. I looked at three different corpora to do this: one containing 3½M non-ASCII codepoints, one containing 14M of them, and one containing 29M of them. It makes an educated guest based on conformance to a particular model. And it does very well. Right now it has only a CLI API and an OO API, no `Export`-based one. Here’s the easiest way to use the CLI API, via a simple program called gank: `$ gank 011526914.txt cp1252 $ gank 00.txt Sym.txt 0115.txt 001313968.txt: ascii 001328180.txt: utf8 007499277.txt: iso-8859-1 Symbola602.txt: UTF-16 011526914.txt: cp1252 011535589.txt: iso-8859-1 011570876.txt: MacRoman` [download] The underlying class’s default training model derives from the complete PubMed Open Access corpus, and it therefore attains an extremely high measured accuracy* of 99.79% when used on English-language biomedical texts. It also does well on other texts using any Latin-based alphabet. I have comparative statistics using two alternate training models, but the PMCOA model is fine for most purposes. You may also give gank a `-s` option to give you a short ‘score-card’ of the various encodings it considered: `91.718532 +2.285393 MacRoman 3.640513 -0.941206 iso-8859-1, iso-8859-15, cp1252 3.639257 -0.941552 cp1250 1.001698 -2.231634 iso-8859-2` [download] EXPLANATION: The first column is all scores normalized to 0..100. The second column is the natural log of the real score. The rest is which encoding have that score, and in the order of preference for breaking ties of the same score. I have it arranged so it says it’s the smallest subset that works; i.e., ascii < latin1 < cp1252, etc. There’s also a `-l` option to give you a long report that illustrates what each possible shoice would be if it were in that encoding, with paired lines of literal UTF-8 and `\N{...}` named characters. total bytes=15903, high bytes=22, distinct high bytes=8 49.582509 +0.909655 cp1252 => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marquèze, Llinás. ScienceDirect® is" => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER E WITH GRAVE}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is" 49.557280 +0.909146 cp1250 => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marqučze, Llinás. ScienceDirect® is" => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER C WITH CARON}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is" 0.860211 -3.144560 MacRoman => "IñV, Copyright © 2001 Outline ï Acknowledgements 12000◊ g S¸dhof MarquËze, Llin·s. ScienceDirectÆ is" => "I\N{LATIN SMALL LETTER N WITH TILDE}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{LATIN SMALL LETTER I WITH DIAERESIS} Acknowledgements 12000\N{LOZENGE} g S\N{CEDILLA}dhof Marqu\N{LATIN CAPITAL LETTER E WITH DIAERESIS}ze, Llin\N{MIDDLE DOT}s. ScienceDirect\N{LATIN CAPITAL LETTER AE} is" I need to do more work on its API — this is just a proof of concept, although it does comes with a halfway decent test suite — and of course document it, but I’m hunkered down right now correcting page-proofs on Camel4, so I probably won’t get to sprucing up the module for another 7–10 days. --tom	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Character encoding woes - unicode or not?
by tchrist (Pilgrim) on Feb 02, 2012 at 13:13 UTC

Encode::Detect is my new best friend, thanks to Anonymous Monk.

 Status  Filename    Right         E::D::Detector
 ==================================================
 Wrong!   7843118       ascii       UNABLE TO GUESS
          9501430       utf8        UTF-8
         10318897       cp1252      windows-1252
         10329150       cp1252      windows-1252
 Wrong!  10358003       MacRoman    UNABLE TO GUESS
 Wrong!  10358042       MacRoman    UNABLE TO GUESS
 Wrong!  10429209       MacRoman    UNABLE TO GUESS
         10482611       cp1252      windows-1252
 Wrong!  10542098       MacRoman    UNABLE TO GUESS
         10617571       cp1252      windows-1252
 Wrong!  10625668       iso-8859-1  windows-1252
         10676968       cp1252      windows-1252
         10677497       cp1252      windows-1252
         10963661       MacRoman    UNABLE TO GUESS
         11042188       macRoman    UNABLE TO GUESS
         11212329       utf8        UTF-8
         11287402       cp1252      windows-1252
 Wrong!  11470876       MacRoman    windows-1252
         11842027       iso-8859-1  windows-1252
 Wrong!  11940257       ascii       UNABLE TO GUESS
 Wrong!  11972335       MacRoman    UNABLE TO GUESS
 Wrong!  12091502       iso-8859-1  windows-1252
         12169614       utf8        UTF-8
         12495435       MacRoman    windows-1252
         12736309       MacRoman    windows-1252
         14641909       MacRoman    windows-1252
         14652344       utf8        UTF-8
         14751857       cp1252      windows-1252
         15037632       cp1252      windows-1252
         15070898       cp1252      windows-1252
 Wrong!  15154606       MacRoman    windows-1252
         15201223       cp1252      windows-1252
 Wrong!  15315962       iso-8859-1  Big5
         15328020       cp1252      windows-1252
 Wrong!  17298172       MacRoman    windows-1252
 Wrong! 116059400       MacRoman    windows-1252
[download]

does

Encode::Guess::Educated

ASCII

UTF

The reason it can do this is that it works off a training model. I looked at three different corpora to do this: one containing 3½M non-ASCII codepoints, one containing 14M of them, and one containing 29M of them. It makes an educated guest based on conformance to a particular model. And it does very well.

Right now it has only a CLI API and an OO API, no Export-based one. Here’s the easiest way to use the CLI API, via a simple program called gank:

$ gank 011526914.txt
cp1252

$ gank 00*.txt Sym*.txt 0115*.txt
001313968.txt: ascii
001328180.txt: utf8
007499277.txt: iso-8859-1
Symbola602.txt: UTF-16
011526914.txt: cp1252
011535589.txt: iso-8859-1
011570876.txt: MacRoman
[download]

extremely high measured accuracy

PMCOA

You may also give gank a -s option to give you a short ‘score-card’ of the various encodings it considered:

  *91.718532 +2.285393 MacRoman
    3.640513 -0.941206 iso-8859-1, iso-8859-15, cp1252
    3.639257 -0.941552 cp1250
    1.001698 -2.231634 iso-8859-2
[download]

The first column is all scores normalized to 0..100.
The second column is the natural log of the real score.
The rest is which encoding have that score, and in the order of preference for breaking ties of the same score. I have it arranged so it says it’s the smallest subset that works; i.e., ascii < latin1 < cp1252, etc.

-l

UTF

\N{...}

total bytes=15903, high bytes=22, distinct high bytes=8
  *49.582509 +0.909655 cp1252
      => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marquèze, Llinás. ScienceDirect® is"
      => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER E WITH GRAVE}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is"
   49.557280 +0.909146 cp1250
      => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marqučze, Llinás. ScienceDirect® is"
      => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER C WITH CARON}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is"
    0.860211 -3.144560 MacRoman
      => "IñV, Copyright © 2001 Outline ï Acknowledgements 12000◊ g S¸dhof MarquËze, Llin·s. ScienceDirectÆ is"
      => "I\N{LATIN SMALL LETTER N WITH TILDE}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{LATIN SMALL LETTER I WITH DIAERESIS} Acknowledgements 12000\N{LOZENGE} g S\N{CEDILLA}dhof Marqu\N{LATIN CAPITAL LETTER E WITH DIAERESIS}ze, Llin\N{MIDDLE DOT}s. ScienceDirect\N{LATIN CAPITAL LETTER AE} is"

API

--tom

[reply]
[d/l]
[select]

In Section Seekers of Perl Wisdom