Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^2: UTF8 Validity

by menolly (Hermit)
on Feb 22, 2008 at 00:47 UTC ( #669440=note: print w/replies, xml ) Need Help??


in reply to Re: UTF8 Validity
in thread UTF8 Validity

Thanks; that's the kind of pointer I need. Most of my non-ASCII/non-UTF8 data is either in contact data or easily connected to contact data, so I've been trying to guess the charset based on the geographic origin, with mixed results. I definitely have multiple encodings present -- so far, there's cp1251 (Cyrillic), latin1, some form of Japanese, and something I can't identify but have scrubbed out in the source DB.

Replies are listed 'Best First'.
Re^3: UTF8 Validity
by graff (Chancellor) on Feb 22, 2008 at 02:18 UTC
    Encode::Guess is likely to be helpful for figuring out the source encodings for many of the Asian (multi-byte-char) strings, though it might not help much for distinguishing among single-byte encodings. Worth a try.

      Encode::Guess is lame because the user needs to tell it which encoding the binary is.

      Use Encode::Detect instead. This is the same detector used in Mozilla browsers.

        I've been using Encode::Guess, but have had trouble building a suspects list for some data. However, Firefox hasn't been able to appropriately handle the problem data, either, so if Encode::Detect is the same method, I doubt it would've done any better on this data.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://669440]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2020-10-26 08:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (250 votes). Check out past polls.

    Notices?