Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re: Composite Charset Data to UTF8?

by Corion (Pope)
on Jun 18, 2013 at 10:28 UTC ( #1039537=note: print w/replies, xml ) Need Help??

in reply to Composite Charset Data to UTF8?

What do you mean by "composite charset"?

The only sane approach is to Encode::decode all data as you read it into Perl, and to Encode::encode the data to the intended target encoding as you write it.

If you don't know the input encoding yet, you have to either use the existing guessing modules or come up with a way of your own to find the "best" possible input encoding of your file(s). For example if you have a dictionary of your source language, you can guess the encoding of a document by finding certain byte sequences that correspond to a word/phrase in that source language. There is very little we can do here without further information.

Update: According to your update, you have not exactly mojibake but still a horrible mess of encodings. Maybe you can still employ the approach of having well-known words/phrases to determine where a new encoding starts, but it will be much, much uglier and harder.

Replies are listed 'Best First'.
Re^2: Composite Charset Data to UTF8?
by AlexTape (Monk) on Jun 18, 2013 at 14:27 UTC
    topic update :)
    $perlig =~ s/pec/cep/g if 'errors expected';

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1039537]
[Discipulus]: robotart

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (9)
As of 2017-05-23 10:55 GMT
Find Nodes?
    Voting Booth?