Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

How best to avoid mojibake, when attempting to automatically convert documents to utf-8?

by taint (Chaplain)
on Dec 20, 2013 at 23:22 UTC ( #1068000=perlquestion: print w/ replies, xml ) Need Help??
taint has asked for the wisdom of the Perl Monks concerning the following question:

This is a first step in the creation of an All-To-UTF-8 type Perl module. I've spent some time researching the possibility of using something already available to accomplish this task -- plow through a directory of text documents of UNknown encoding/charset recursively, and accurately detect their encoding/charset, and reliably convert them to utf-8.

Yes. But what if their not correctly defined, in the first place?
Well. Unless I can figure out a better solution. It's garbage in, garbage out.
No harm done. It wasn't right in the first place. So would have been/ become mojibake anyway.

While I recognize that Perl is said to be (very?) good at handling this sort of thing. My personal experience, and judging by other nodes here at PM, indicates there is room for improvement. It is not my intention to "fix Perl" in any way. But rather, create a module that will "ease the pain" often encountered doing such things.

This is a formal solicitation for comments, examples, and recommendations for accomplishing this task. Thus far, I have carefully looked at perlmod, Encode, Encode::Detect, Encode::Detect::Dector, Encode::Encoding, Encode::Guess Encode::Unicode -- OH, and Bush hid the facts
Any other suggestions for research? Anyone already have a method that accomplishes this task? An example of a possible solution?

Well. In summary. At this point, I'm attempting to find the best solution to detecting, in the best possible manner, the current/actual encoding/charset of a document. Methods for the actual conversion will come in a different node/thread. At a later time. :)

Thank you for all your consideration.

--Chris

UPDATE: Just a thanks to Your Mother, and moritz. For some hints
Yes. What say about me, is true.

Comment on How best to avoid mojibake, when attempting to automatically convert documents to utf-8?
Re: How best to avoid mojibake, when attempting to automatically convert documents to utf-8?
by Your Mother (Canon) on Dec 20, 2013 at 23:40 UTC
    and accurately detect their encoding/charset, and reliably convert them to utf-8.

    While you can sometimes do a good job, this isn't possible with reliability. This is a rescue/emergency tactic when confronted with broken data. Differing character sets overlap in the bytes that can be used to make them, sometimes a lot. A single byte of garbage can wreck accurate detection on an otherwise obvious/valid guess. The modules you list are the way to go but the two descriptions of this problem you've posted make it feel like an XY problem.

    It's only tangentially related but I recommend reading this—🐪🐫🐪🐫🐪: Why does modern Perl avoid UTF-8 by default?—many times. While there is always room for improvement in any endeavor I suspect digging in and seeing how deep the problems actually run may sober your drive to add to the toolset. Go code diving in those modules and add the Unicode::Tussle scripts to the pile if you are getting through the reading too quickly. :P

      Thank you very much Your Mother, for the reply.

      Sounds discouraging. :(

      Seems like somebody should do it. Maybe a team effort? I dunno. Still attempting to work out all the details.

      "but the two descriptions of this problem you've posted make it feel like an XY problem."
      Any thoughts for a better title? I'm always open for suggestion(s).

      Thanks again, for the reply Your Mother. Looks like I still have a great deal of reading to do, yet. :/

      --Chris

      UPDATE: Why does modern Perl avoid UTF-8 by default? was a great read. Thanks!
      Yes. What say about me, is true.
      
      OK. Looks like I responded too soon. So in an effort to do your response justice. I'll try to give it a proper response, this time. :)

      The article unicode - Why does modern Perl avoid UTF-8 by default? was extremely informative. A big help -- thanks!

      brian d foy's Unicode-Tussle utilities, could quite possibly go a long way to helping me in my current quest. Thanks again.

      In the case of Unicode-Tussle. I should be able to use some of them to help, at least determine what ever I'm parsing/gulping/slurping/chomping, claims it's "code points" are. At least they'll likely help creating initial phases of testing. Or maybe provide some bits I can include in a larger, more conclusive test. It's early, but looks promising. Well, I've got more research to do. Thanks again for the great links, Your Mother!

      --Chris

      Yes. What say about me, is true.
      
Re: How best to avoid mojibake, when attempting to automatically convert documents to utf-8?
by Jim (Curate) on Dec 23, 2013 at 05:23 UTC
      Jim, thank you very much for the reply.

      Indeed. Those were very pertinent nodes. Seems you also struggle with the lack of such a utility/module/{...}. :)

      Honestly. I can't for the life of me, understand why this sort of thing hasn't already been solved. Which is why I felt it worth all the work, and research likely involved. After all, the world is now very much a world of utf-8. It's no longer a "concept".

      Thank you again, for the resources, and reply, Jim ++

      --Chris

      ˇλɐp ʇɑəɹ⅁ ɐ əʌɐɥ puɐ ʻꜱdləɥ ꜱᴉɥʇ ədoH

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1068000]
Approved by rnewsham
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (11)
As of 2014-12-25 05:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (159 votes), past polls