Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re^3: Detect the Charset of an file

by graff (Chancellor)
on Oct 23, 2013 at 05:21 UTC ( #1059281=note: print w/replies, xml ) Need Help??

in reply to Re^2: Detect the Charset of an file
in thread Detect the Charset of an file

You said:

I only need some code to detect if the file is already utf8, then we don't do the recode.

The easiest way to check whether your data is utf8 is to read it as "raw" and try decoding it from utf8. If that succeeds, the data is clearly utf8. The reason why this is a good solution is that non-ASCII, non-utf8 data will virtually ALWAYS throw an error if you try to interpret it as utf8 data.

use Encode; open( my $fh, "<:raw", $filename ) or die; local $/; $_ = <$fh>; eval { $_ = decode( 'utf8', $_, Encode::FB_CROAK ) }; if ( $@ ) { print "$filename is NOT UTF8\n"; } else { print "$filename IS UTF8\n"; }
Note that when given an ASCII file, the above will say "$filename IS UTF8", which of course is true.

UPDATE: Just noticed a missing semi-colon at the end of the eval block -- fixed it.

Replies are listed 'Best First'.
Re^4: Detect the Charset of an file
by endymion (Acolyte) on Oct 23, 2013 at 06:48 UTC
    Hello graff, I tried your great stuff, but I get another bug. I'll try now with file -i with system.
      Sorry about the problem. I just noticed that I had left out a semi-colon when I first posted that snippet -- that's fixed now, in case you want to try again.
Re^4: Detect the Charset of an file
by Anonymous Monk on Oct 24, 2013 at 13:47 UTC
    No Problem. I have seen this by myself and fixed it in the script. Maybe your great script helps others with the same problems. I have solved it with system file -i, works great and I have no problem with the xml parser. Thanks for your great help.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1059281]
[Corion]: perldigious: That seems to be more the export and likely it's the recipients of that export that like the titles changes
[Corion]: ... "changed"
[Corion]: I usually expect fixed header names, but am sometimes lenient in the order of columns. But changing the report titles often sounds to me as if you are not the sole consument of the export ;)
[shmem]: perldigious: as always - if it ain't broke, don't fix it. Ther must be a very compelling reason for changing column names in a database. Those are rare.
[Corion]: If you have whitespace in the column names in the database, whap the DBAs ;)
[shmem]: It's common for some vendors to have column names such as WRSTVG or some other such whizzbang, and another table where these names are mapped to something meaningful depending on how you look at the data

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (9)
As of 2017-05-25 13:35 GMT
Find Nodes?
    Voting Booth?