Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Composite Charset Data to UTF8?

by Corion (Pope)
on Jun 18, 2013 at 10:28 UTC ( #1039537=note: print w/ replies, xml ) Need Help??


in reply to Composite Charset Data to UTF8?

What do you mean by "composite charset"?

The only sane approach is to Encode::decode all data as you read it into Perl, and to Encode::encode the data to the intended target encoding as you write it.

If you don't know the input encoding yet, you have to either use the existing guessing modules or come up with a way of your own to find the "best" possible input encoding of your file(s). For example if you have a dictionary of your source language, you can guess the encoding of a document by finding certain byte sequences that correspond to a word/phrase in that source language. There is very little we can do here without further information.

Update: According to your update, you have not exactly mojibake but still a horrible mess of encodings. Maybe you can still employ the approach of having well-known words/phrases to determine where a new encoding starts, but it will be much, much uglier and harder.


Comment on Re: Composite Charset Data to UTF8?
Re^2: Composite Charset Data to UTF8?
by AlexTape (Monk) on Jun 18, 2013 at 14:27 UTC
    topic update :)
    $perlig =~ s/pec/cep/g if 'errors expected';

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1039537]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (13)
As of 2014-10-23 18:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (127 votes), past polls