Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^3: Detect the Charset of an file

by graff (Chancellor)
on Oct 23, 2013 at 05:21 UTC ( #1059281=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Detect the Charset of an file
in thread Detect the Charset of an file

You said:

I only need some code to detect if the file is already utf8, then we don't do the recode.

The easiest way to check whether your data is utf8 is to read it as "raw" and try decoding it from utf8. If that succeeds, the data is clearly utf8. The reason why this is a good solution is that non-ASCII, non-utf8 data will virtually ALWAYS throw an error if you try to interpret it as utf8 data.

use Encode; open( my $fh, "<:raw", $filename ) or die; local $/; $_ = <$fh>; eval { $_ = decode( 'utf8', $_, Encode::FB_CROAK ) }; if ( $@ ) { print "$filename is NOT UTF8\n"; } else { print "$filename IS UTF8\n"; }
Note that when given an ASCII file, the above will say "$filename IS UTF8", which of course is true.

UPDATE: Just noticed a missing semi-colon at the end of the eval block -- fixed it.


Comment on Re^3: Detect the Charset of an file
Download Code
Re^4: Detect the Charset of an file
by endymion (Acolyte) on Oct 23, 2013 at 06:48 UTC
    Hello graff, I tried your great stuff, but I get another bug. I'll try now with file -i with system.
      Sorry about the problem. I just noticed that I had left out a semi-colon when I first posted that snippet -- that's fixed now, in case you want to try again.
Re^4: Detect the Charset of an file
by Anonymous Monk on Oct 24, 2013 at 13:47 UTC
    No Problem. I have seen this by myself and fixed it in the script. Maybe your great script helps others with the same problems. I have solved it with system file -i, works great and I have no problem with the xml parser. Thanks for your great help.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1059281]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (11)
As of 2014-07-23 16:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (147 votes), past polls