Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use?

by Anonymous Monk
on Apr 21, 2013 at 08:11 UTC ( #1029735=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use?

use ...; my $text = ...($file);

There are lots of half-option (Encode::Guess , File::BOM, Encode::Detective...) out there, but I don't know of a single functions like this, do you know of one?

Comment on How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use?
Download Code
Re: How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use?
by Khen1950fx (Canon) on Apr 21, 2013 at 08:55 UTC
    To remove a BOM from a file, use String::BOM.
    #!/usr/bin/perl -l use strict; use warnings; use String::BOM qw(strip_bom_from_file); my $file = '/path/to/file'; print strip_bom_from_file($file);
    Prints 1 on success. Uses $! on failure.
      Typical Khen1950fx , ignores the answer in the question, ignores the question, posts broken links
        Khen's link may be broken, but String::BOM is a good solution.

        Your criticism is easy, but not helpful.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics

      Hi,

      Encode::Guess does a fine job to detect the encoding. Read its documentation carefully on CPAN. To detect the encoding you can use something like:

      open ( IN, "<", yourfile); my $bigstring = ""; my @content = <IN>; foreach my $tmp (@content) { $bigstring .= $tmp; } print "My file content encoding is: ", Encode::Guess->guess($bigstring +)->name;

      Now you can decode and encode your data in the encoding you want. You need to have a strategy as to this matter. I recomment keeping it in UTF8 or 16 depending on the case. If you face BOM issues String::BOM is a good solution.

      The following might help further: http://perldoc.perl.org/perluniintro.html

      K

      The best medicine against depression is a cold beer!
Re: How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use?
by graff (Chancellor) on Apr 21, 2013 at 23:31 UTC
    How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use?

    I use trial-and-error. I first try to treat it as utf8; if that doesn't throw an error, I'm done. (Also, utf8 might be the most likely outcome anyway.) If the text is not uft8, trying to read it as utf8 will definitely fail, and I'll know for certain that it's some other encoding.

    In the latter case, I hope I have some idea of what (human) language the text is supposed to contain, because that will guide how I check for other encodings.

    For example, if the language is not Chinese, Japanese or Korean (CJK), the writing system will be one or another alphabet set, usually requiring less than 128 distinct code points; in this case, a UTF-16 encoding will have a rather lopsided byte histogram, because half the bytes (the ones for the upper 8 bits of each character) will have a very limited distribution of values: lots of nulls, and (depending on the language), lots of, say, 0x06 (if it's Arabic) or 0x04 (if it's Cyrillic), etc. Seeing whether these values occur at even or odd byte offsets will reveal whether the UTF-16 is BE or LE.

    If the text is supposed to be CJK (and it isn't utf8), I'll go right to Encode::Guess. Likewise if the text is clearly not a 16-bit encoding (i.e. it's not CJK, not UTF-16, and not utf8).

    You could probably rely more heavily on Encode::Guess for more of the scenarios, in order to reduce the manual effort. But there are bound to be cases where you really just need to have a human involved (ideally one who knows the language being used in the text).

    Bigram statistics for each "language/encoding" tuple serves well as a discriminator, but this depends on having reliable training data for each tuple. If you happen to be dealing with a closed set of possible input types, and just need an automatic way to differentiate between them, you only need a few hundred KB of text per language/encoding tuple to get fairly distinctive bigram statistics.

    In effect, in languages that use single-byte encodings, pair-wise byte sequences fall into fairly predictable rankings in terms of frequency of occurrence, and the rankings are distinct from one language to the next. Extending this to CJK would involve a larger quantity of training data, and/or doing statistics on 4-byte sequences (i.e. pairings of 16-bit characters).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1029735]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2014-09-03 07:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (35 votes), past polls