Re: Dealing with non-ascii characters when reading file.

If I think there's something hinky about a file because it contains "unexpected" byte values, I would check its inventory of byte values, with something like this:

#!/usr/bin/perl

use strict;
use warnings;

die "Usage: $0 file.name\n" unless ( @ARGV == 1 and -f $ARGV[0] );
open( FH, shift );
binmode FH;

$/ = undef;
$_ = <FH>;

my %char_hist;

for my $c ( split // ) {
    $char_hist{ sprintf( "%02x", ord( $c )) }++;
}
for my $c ( sort keys %char_hist ) {
    printf "%s\t%d\n", $c, $char_hist{$c};
}
[download]

(That's just a toy version to try it out on files that aren't seriously large. I'd do it differently for general use.)

It's sometimes surprising what you can learn about a file just by looking at a histogram of its byte values - seeing which values occur, and which ones don't.

(If you happen to know that a file contains utf8-encoded text, you can learn a lot by looking at a histogram of its Unicode characters - I posted a script for that too: unichist -- count/summarize characters in data.

Comment on Re: Dealing with non-ascii characters when reading file. Download Code


Perl Monk, Perl Meditation
	PerlMonks