Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Is utf8, ascii ?

by rootcho (Pilgrim)
on Aug 07, 2007 at 17:22 UTC ( #631099=perlquestion: print w/replies, xml ) Need Help??

rootcho has asked for the wisdom of the Perl Monks concerning the following question:

hi,
I had a problem with Postgresql trying to insert some non standard characters (i think some sort of Japanese).
The DBD::Pg just died :( complaining it was not UTF8, when trying to do that. (it is a script that parses files and inserts into the the DB)

So what I've decided to do is check in some way for utf8-ness of the string before inserting. I tried with Test::utf8 and later with Encode::Guess::guess_encoding($str).

My question is ? Is this the best and fastest way to check if the string will not break the db-insert ? Will I miss some other charsets this way, which otherwise will do well.

The DB is set to UTF8 of course.

Replies are listed 'Best First'.
Re: Is utf8, ascii ?
by atemon (Chaplain) on Aug 07, 2007 at 18:00 UTC

    Hi,

    Did you try Encode ? DBD::Pg itself recomends Encode in its documentation. Encode explains how to encode & decode and set flags with DBD::Pg. Also, Encode support turning on/off the UTF flags in the encoded string. Even if Encode::Guess is there, I think we have more control over encoded string when encoded with Encode. Hope this helps.

    Cheers !

    --VC



    There are three sides to any argument.....
    your side, my side and the right side.

Re: Is utf8, ascii ?
by clinton (Priest) on Aug 07, 2007 at 19:01 UTC
    From the core utf8 module, you can use:

    utf8::valid($string)

    But presumably, you don't want to just discard data, Instead, you want to convert it to UTF8 and insert it safely. If you know what character set it is in, then use Encode to convert it. Otherwise, as you have done, you can use Encode::Guess to try to figure out what character set it is first.

    Clint

      I see.
      I'm new to these encode stuff, but now I understand... check, guess try to encode, if not discard.
      At the moment I want just to discard, later when I have time will do more tests
      But my next question was... if I check for valid utf8 string and discard. Will this discard the string if it is ascii ?
        No. U+0000 to U+007F (the first 128 Unicode characters) are represented in UTF8 by one byte - the same byte that is used in ASCII. So ASCII (7 bit ASCII, not eg ISO-8859-* or WINDOWS-1252) is a subset of UTF8.
Re: Is utf8, ascii ?
by graff (Chancellor) on Aug 07, 2007 at 22:22 UTC
    I've posted a couple of unicode-related utilities here at the monastery: unichist -- count/summarize characters in data and tlu -- TransLiterate Unicode. The first one might be enough for you to figure out what sort of data you have in your files.

    If the file data is already in utf8, you should be able to do

    unichist -x file.name
    and that would show you all the distinct unicode characters in the file, one per line (with frequency of occurrence and hex code-point value for each character).

    But if you see lots of "Malformed UTF-8" messages, the data is encoded in some other (non-unicode) character set. You can use a command line option to try different encodings on input until you hit on the one that works for your data (the script uses Encode to apply input decoding if the "-r enc" option is given):

    unichist -x -r euc-jp file.name ... # if you see errors or lots of "FFFD" characters, you guessed wron +g unichist -x -r shiftjis file.name ...
    The Encode man page tells how to get a listing of available character sets (or you can look at yet another tool I posted -- grepp -- Perl version of grep -- to see how to list the encodings).
Re: Is utf8, ascii ?
by tbone1 (Monsignor) on Aug 07, 2007 at 17:53 UTC
    Maybe you could just do a search for any character that isnt 000 to 0ff? If you find it, it's not ASCII. Just a thought (and possibly a bad one).

    --
    tbone1, YAPS (Yet Another Perl Schlub)
    And remember, if he succeeds, so what.
    - Chick McGee

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://631099]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2020-12-01 09:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?