Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^3: Composite Charset Data to UTF8?

by Corion (Patriarch)
on Jun 19, 2013 at 12:07 UTC ( [id://1039767]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Composite Charset Data to UTF8?
in thread Composite Charset Data to UTF8?

Have a look at the encoding rules of UTF-8.

A valid UTF-8 sequence starts either with 0b0xxxxxxx or with 0b11xxxxxx. So any octet starting with 0xb10xxxxxx is invalid UTF-8:

> perl -wle "print sprintf '%08b', $_ for (0xa9,0xae)" 10101001 10101110

An untested easy check could be to match your string against /[\x80-\xBF]/, which are the hex representations of the bit patterns we've identified:

perl -wle "print sprintf '%08b - %02x', $_,$_ for (0b10000000,0b101111 +11)" 10000000 - 80 10111111 - bf

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1039767]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2024-04-23 12:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found