Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Dealing with non-ascii characters when reading file.

by graff (Chancellor)
on Sep 26, 2014 at 02:46 UTC ( [id://1102074]=note: print w/replies, xml ) Need Help??


in reply to Dealing with non-ascii characters when reading file.

If I think there's something hinky about a file because it contains "unexpected" byte values, I would check its inventory of byte values, with something like this:
#!/usr/bin/perl use strict; use warnings; die "Usage: $0 file.name\n" unless ( @ARGV == 1 and -f $ARGV[0] ); open( FH, shift ); binmode FH; $/ = undef; $_ = <FH>; my %char_hist; for my $c ( split // ) { $char_hist{ sprintf( "%02x", ord( $c )) }++; } for my $c ( sort keys %char_hist ) { printf "%s\t%d\n", $c, $char_hist{$c}; }
(That's just a toy version to try it out on files that aren't seriously large. I'd do it differently for general use.)

It's sometimes surprising what you can learn about a file just by looking at a histogram of its byte values - seeing which values occur, and which ones don't.

(If you happen to know that a file contains utf8-encoded text, you can learn a lot by looking at a histogram of its Unicode characters - I posted a script for that too: unichist -- count/summarize characters in data.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1102074]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (4)
As of 2024-04-19 03:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found