Re^3: Detect the Charset of an file

by graff (Chancellor)
on Oct 23, 2013 at 05:21 UTC

in reply to Re^2: Detect the Charset of an file
in thread Detect the Charset of an file

You said:

I only need some code to detect if the file is already utf8, then we don't do the recode.

The easiest way to check whether your data is utf8 is to read it as "raw" and try decoding it from utf8. If that succeeds, the data is clearly utf8. The reason why this is a good solution is that non-ASCII, non-utf8 data will virtually ALWAYS throw an error if you try to interpret it as utf8 data.

use Encode; open( my $fh, "<:raw", $filename ) or die; local $/; $_ = <$fh>; eval { $_ = decode( 'utf8', $_, Encode::FB_CROAK ) }; if ( $@ ) { print "$filename is NOT UTF8\n"; } else { print "$filename IS UTF8\n"; }
Note that when given an ASCII file, the above will say "$filename IS UTF8", which of course is true.

UPDATE: Just noticed a missing semi-colon at the end of the eval block -- fixed it.

Re^4: Detect the Charset of an file
by endymion (Acolyte) on Oct 23, 2013 at 06:48 UTC
    Hello graff, I tried your great stuff, but I get another bug. I'll try now with file -i with system.
      Sorry about the problem. I just noticed that I had left out a semi-colon when I first posted that snippet -- that's fixed now, in case you want to try again.
Re^4: Detect the Charset of an file
by Anonymous Monk on Oct 24, 2013 at 13:47 UTC
    No Problem. I have seen this by myself and fixed it in the script. Maybe your great script helps others with the same problems. I have solved it with system file -i, works great and I have no problem with the xml parser. Thanks for your great help.

