Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^2: Composite Charset Data to UTF8?

by AlexTape (Monk)
on Jun 19, 2013 at 11:56 UTC ( #1039762=note: print w/ replies, xml ) Need Help??


in reply to Re: Composite Charset Data to UTF8?
in thread Composite Charset Data to UTF8?

ok, thats like my first approach:

use utf8; use open ':std', ':encoding(UTF-8)'; use open IO => ':encoding(UTF-8)';
but ok.. internal error like this:
utf8 "\xA9" does not map to Unicode at /usr/local/share/perl/5.14.2/XML/Tidy.pm line 780.
utf8 "\xAE" does not map to Unicode at /usr/local/share/perl/5.14.2/XML/Tidy.pm line 782.

anyway that is not the really part of the problem.. anybody got a quick solution to test a file for a constant charset? e.g. true/false for file eq utf8 or not?! can i say that the file is utf after utf8::decode($_) or die "Input is not valid UTF-8";    just to say there are more then one charsets in the file or not??? or is it part of the problem?!

kindly perlig
$perlig =~ s/pec/cep/g if 'errors expected';


Comment on Re^2: Composite Charset Data to UTF8?
Select or Download Code
Replies are listed 'Best First'.
Re^3: Composite Charset Data to UTF8?
by Corion (Pope) on Jun 19, 2013 at 12:07 UTC

    Have a look at the encoding rules of UTF-8.

    A valid UTF-8 sequence starts either with 0b0xxxxxxx or with 0b11xxxxxx. So any octet starting with 0xb10xxxxxx is invalid UTF-8:

    > perl -wle "print sprintf '%08b', $_ for (0xa9,0xae)" 10101001 10101110

    An untested easy check could be to match your string against /[\x80-\xBF]/, which are the hex representations of the bit patterns we've identified:

    perl -wle "print sprintf '%08b - %02x', $_,$_ for (0b10000000,0b101111 +11)" 10000000 - 80 10111111 - bf

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1039762]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2015-07-29 19:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (267 votes), past polls