Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Is utf8, ascii ?

by graff (Chancellor)
on Aug 07, 2007 at 22:22 UTC ( [id://631173]=note: print w/replies, xml ) Need Help??


in reply to Is utf8, ascii ?

I've posted a couple of unicode-related utilities here at the monastery: unichist -- count/summarize characters in data and tlu -- TransLiterate Unicode. The first one might be enough for you to figure out what sort of data you have in your files.

If the file data is already in utf8, you should be able to do

unichist -x file.name
and that would show you all the distinct unicode characters in the file, one per line (with frequency of occurrence and hex code-point value for each character).

But if you see lots of "Malformed UTF-8" messages, the data is encoded in some other (non-unicode) character set. You can use a command line option to try different encodings on input until you hit on the one that works for your data (the script uses Encode to apply input decoding if the "-r enc" option is given):

unichist -x -r euc-jp file.name ... # if you see errors or lots of "FFFD" characters, you guessed wron +g unichist -x -r shiftjis file.name ...
The Encode man page tells how to get a listing of available character sets (or you can look at yet another tool I posted -- grepp -- Perl version of grep -- to see how to list the encodings).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://631173]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2024-04-19 05:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found