Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Composite Charset Data to UTF8?

by AlexTape (Monk)
on Jun 18, 2013 at 10:21 UTC ( #1039536=perlquestion: print w/ replies, xml ) Need Help??
AlexTape has asked for the wisdom of the Perl Monks concerning the following question:

Dear omniscient monks,

i try to translate big text files with composite charsets to a constant UTF8 encoding.

anyway my investingation to this topic run into a black whole of nescience.. whats the best way to do it especially with perl?

perhaps you can give me hints or some "simple" explanations how you would do it? i know that there are CPAN::modules to identify "non-utf8" chars but on which level? is it sensefull to take the binary way or to make a comparison on the hexadecimal level?

this is the first time i really get involved with perl into the whole charset jungle..

im still mindmapping ;))

kindly, perlig


---- UPDATE ----

Ok. Maybe the Input looks like this:

Textfile with 100 Chars:
40 of them were Italian (it) iso-8859-1, windows-1252
20 of them were Greek (el) iso-8859-7
all others UTF8

(see e.g. http://www.w3.org/International/O-charset-lang.html)

Now i want to process this data.. but my parser is only able to read utf8. for that i have to encode these 60 "non-utf8" chars to utf8 on a certain way..

got it? :)

im nearly overstrained :P can you mabe tell me something about the existing guessing modules?!

kindly perlig

$perlig =~ s/pec/cep/g if 'errors expected';

Comment on Composite Charset Data to UTF8?
Re: Composite Charset Data to UTF8?
by Corion (Pope) on Jun 18, 2013 at 10:28 UTC

    What do you mean by "composite charset"?

    The only sane approach is to Encode::decode all data as you read it into Perl, and to Encode::encode the data to the intended target encoding as you write it.

    If you don't know the input encoding yet, you have to either use the existing guessing modules or come up with a way of your own to find the "best" possible input encoding of your file(s). For example if you have a dictionary of your source language, you can guess the encoding of a document by finding certain byte sequences that correspond to a word/phrase in that source language. There is very little we can do here without further information.

    Update: According to your update, you have not exactly mojibake but still a horrible mess of encodings. Maybe you can still employ the approach of having well-known words/phrases to determine where a new encoding starts, but it will be much, much uglier and harder.

      topic update :)
      $perlig =~ s/pec/cep/g if 'errors expected';
Re: Composite Charset Data to UTF8?
by Khen1950fx (Canon) on Jun 18, 2013 at 15:02 UTC
    Encode::StdIO is what you're looking for. For example:
    #!/usr/bin/perl use strict; use warnings; use Encode::StdIO encoding => 'utf-8';
    Your STDOUT and STDERR will automatically be encoded in utf8.

    Also, note that the author recommends Term::Encoding, so I would install that first, then Encode::StdIO.

      ok, thats like my first approach:
      use utf8; use open ':std', ':encoding(UTF-8)'; use open IO => ':encoding(UTF-8)';
      but ok.. internal error like this:
      utf8 "\xA9" does not map to Unicode at /usr/local/share/perl/5.14.2/XML/Tidy.pm line 780.
      utf8 "\xAE" does not map to Unicode at /usr/local/share/perl/5.14.2/XML/Tidy.pm line 782.

      anyway that is not the really part of the problem.. anybody got a quick solution to test a file for a constant charset? e.g. true/false for file eq utf8 or not?! can i say that the file is utf after utf8::decode($_) or die "Input is not valid UTF-8";    just to say there are more then one charsets in the file or not??? or is it part of the problem?!

      kindly perlig
      $perlig =~ s/pec/cep/g if 'errors expected';

        Have a look at the encoding rules of UTF-8.

        A valid UTF-8 sequence starts either with 0b0xxxxxxx or with 0b11xxxxxx. So any octet starting with 0xb10xxxxxx is invalid UTF-8:

        > perl -wle "print sprintf '%08b', $_ for (0xa9,0xae)" 10101001 10101110

        An untested easy check could be to match your string against /[\x80-\xBF]/, which are the hex representations of the bit patterns we've identified:

        perl -wle "print sprintf '%08b - %02x', $_,$_ for (0b10000000,0b101111 +11)" 10000000 - 80 10111111 - bf

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1039536]
Approved by hdb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2014-09-02 05:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (20 votes), past polls