Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^2: Character encoding woes - unicode or not?

by japhy (Canon)
on Jan 31, 2012 at 15:08 UTC ( #950997=note: print w/ replies, xml ) Need Help??


in reply to Re: Character encoding woes - unicode or not?
in thread Character encoding woes - unicode or not?

Encode::Detect is my new best friend, thanks to Anonymous Monk.

Jeffrey Pinyan (Perl, PHP ugh, JavaScript) — @PrayingTheMass
Melius servire volo
Catholic Liturgy


Comment on Re^2: Character encoding woes - unicode or not?
Replies are listed 'Best First'.
Re^3: Character encoding woes - unicode or not?
by tchrist (Pilgrim) on Feb 02, 2012 at 13:13 UTC
    Encode::Detect is my new best friend, thanks to Anonymous Monk.
    Then you don't know it very well. It’s really rather poor. Here are the failure results on a small sampling of real-world test files:
     Status  Filename Right      E::D::Detector  ==================================================  Wrong!   7843118       ascii       UNABLE TO GUESS           9501430       utf8        UTF-8          10318897       cp1252      windows-1252          10329150       cp1252      windows-1252  Wrong!  10358003       MacRoman    UNABLE TO GUESS  Wrong!  10358042       MacRoman    UNABLE TO GUESS  Wrong!  10429209       MacRoman    UNABLE TO GUESS          10482611       cp1252      windows-1252  Wrong!  10542098       MacRoman    UNABLE TO GUESS          10617571       cp1252      windows-1252  Wrong!  10625668       iso-8859-1  windows-1252          10676968       cp1252      windows-1252          10677497       cp1252      windows-1252          10963661       MacRoman    UNABLE TO GUESS          11042188       macRoman    UNABLE TO GUESS          11212329       utf8        UTF-8          11287402       cp1252      windows-1252  Wrong!  11470876       MacRoman    windows-1252          11842027       iso-8859-1  windows-1252  Wrong!  11940257       ascii       UNABLE TO GUESS  Wrong!  11972335       MacRoman    UNABLE TO GUESS  Wrong!  12091502       iso-8859-1  windows-1252          12169614       utf8        UTF-8          12495435       MacRoman    windows-1252          12736309       MacRoman    windows-1252          14641909       MacRoman    windows-1252          14652344       utf8        UTF-8          14751857       cp1252      windows-1252          15037632       cp1252      windows-1252          15070898       cp1252      windows-1252  Wrong!  15154606       MacRoman    windows-1252          15201223       cp1252      windows-1252  Wrong!  15315962       iso-8859-1  Big5          15328020       cp1252      windows-1252  Wrong!  17298172       MacRoman    windows-1252  Wrong! 116059400       MacRoman    windows-1252
    I have a working snapshot of a module that actually does work right on such things called Encode::Guess::Educated. It has no noncore dependencies. It is designed to detect the encoding of English-language biomedical research papers. It can reliably detect not merely ASCII and UTF-{8,16,32}, but also the very conflicting 8-bit encodings.

    The reason it can do this is that it works off a training model. I looked at three different corpora to do this: one containing 3½M non-ASCII codepoints, one containing 14M of them, and one containing 29M of them. It makes an educated guest based on conformance to a particular model. And it does very well.

    Right now it has only a CLI API and an OO API, no Export-based one. Here’s the easiest way to use the CLI API, via a simple program called gank:

    $ gank 011526914.txt cp1252 $ gank 00*.txt Sym*.txt 0115*.txt 001313968.txt: ascii 001328180.txt: utf8 007499277.txt: iso-8859-1 Symbola602.txt: UTF-16 011526914.txt: cp1252 011535589.txt: iso-8859-1 011570876.txt: MacRoman
    The underlying class’s default training model derives from the complete PubMed Open Access corpus, and it therefore attains an extremely high measured accuracy of 99.79% when used on English-language biomedical texts. It also does well on other texts using any Latin-based alphabet. I have comparative statistics using two alternate training models, but the PMCOA model is fine for most purposes.

    You may also give gank a -s option to give you a short ‘score-card’ of the various encodings it considered:

    *91.718532 +2.285393 MacRoman 3.640513 -0.941206 iso-8859-1, iso-8859-15, cp1252 3.639257 -0.941552 cp1250 1.001698 -2.231634 iso-8859-2
    EXPLANATION:
    • The first column is all scores normalized to 0..100.
    • The second column is the natural log of the real score.
    • The rest is which encoding have that score, and in the order of preference for breaking ties of the same score. I have it arranged so it says it’s the smallest subset that works; i.e., ascii < latin1 < cp1252, etc.
    There’s also a -l option to give you a long report that illustrates what each possible shoice would be if it were in that encoding, with paired lines of literal UTF-8 and \N{...} named characters.
    total bytes=15903, high bytes=22, distinct high bytes=8
      *49.582509 +0.909655 cp1252
          => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marquèze, Llinás. ScienceDirect® is"
          => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER E WITH GRAVE}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is"
       49.557280 +0.909146 cp1250
          => "I–V, Copyright © 2001 Outline • Acknowledgements 12000× g Südhof Marqučze, Llinás. ScienceDirect® is"
          => "I\N{EN DASH}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{BULLET} Acknowledgements 12000\N{MULTIPLICATION SIGN} g S\N{LATIN SMALL LETTER U WITH DIAERESIS}dhof Marqu\N{LATIN SMALL LETTER C WITH CARON}ze, Llin\N{LATIN SMALL LETTER A WITH ACUTE}s. ScienceDirect\N{REGISTERED SIGN} is"
        0.860211 -3.144560 MacRoman
          => "IñV, Copyright © 2001 Outline ï Acknowledgements 12000◊ g S¸dhof MarquËze, Llin·s. ScienceDirectÆ is"
          => "I\N{LATIN SMALL LETTER N WITH TILDE}V, Copyright \N{COPYRIGHT SIGN} 2001 Outline \N{LATIN SMALL LETTER I WITH DIAERESIS} Acknowledgements 12000\N{LOZENGE} g S\N{CEDILLA}dhof Marqu\N{LATIN CAPITAL LETTER E WITH DIAERESIS}ze, Llin\N{MIDDLE DOT}s. ScienceDirect\N{LATIN CAPITAL LETTER AE} is"
    
    I need to do more work on its API — this is just a proof of concept, although it does comes with a halfway decent test suite — and of course document it, but I’m hunkered down right now correcting page-proofs on Camel4, so I probably won’t get to sprucing up the module for another 7–10 days.

    --tom

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://950997]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (10)
As of 2015-07-31 05:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (274 votes), past polls