Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use?

I use trial-and-error. I first try to treat it as utf8; if that doesn't throw an error, I'm done. (Also, utf8 might be the most likely outcome anyway.) If the text is not uft8, trying to read it as utf8 will definitely fail, and I'll know for certain that it's some other encoding.

In the latter case, I hope I have some idea of what (human) language the text is supposed to contain, because that will guide how I check for other encodings.

For example, if the language is not Chinese, Japanese or Korean (CJK), the writing system will be one or another alphabet set, usually requiring less than 128 distinct code points; in this case, a UTF-16 encoding will have a rather lopsided byte histogram, because half the bytes (the ones for the upper 8 bits of each character) will have a very limited distribution of values: lots of nulls, and (depending on the language), lots of, say, 0x06 (if it's Arabic) or 0x04 (if it's Cyrillic), etc. Seeing whether these values occur at even or odd byte offsets will reveal whether the UTF-16 is BE or LE.

If the text is supposed to be CJK (and it isn't utf8), I'll go right to Encode::Guess. Likewise if the text is clearly not a 16-bit encoding (i.e. it's not CJK, not UTF-16, and not utf8).

You could probably rely more heavily on Encode::Guess for more of the scenarios, in order to reduce the manual effort. But there are bound to be cases where you really just need to have a human involved (ideally one who knows the language being used in the text).

Bigram statistics for each "language/encoding" tuple serves well as a discriminator, but this depends on having reliable training data for each tuple. If you happen to be dealing with a closed set of possible input types, and just need an automatic way to differentiate between them, you only need a few hundred KB of text per language/encoding tuple to get fairly distinctive bigram statistics.

In effect, in languages that use single-byte encodings, pair-wise byte sequences fall into fairly predictable rankings in terms of frequency of occurrence, and the rankings are distinct from one language to the next. Extending this to CJK would involve a larger quantity of training data, and/or doing statistics on 4-byte sequences (i.e. pairings of 16-bit characters).


In reply to Re: How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use? by graff
in thread How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use? by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (8)
As of 2024-03-28 19:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found