Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

unichist -- count/summarize characters in data

by graff (Chancellor)
on Sep 19, 2006 at 07:02 UTC ( #573646=sourcecode: print w/ replies, xml ) Need Help??

Category: Text Processing
Author/Contact Info David Graff, Linguistic Data Consortium
Description: A nice tool for getting to know your unicode or other non-ASCII text data. For all files named in @ARGV, or all data piped to STDIN, we count and report how many times each character occurs, and optionally summarize the quantities according to the current standard set of Unicode "charts" (groupings of characters according to language or function). Also lets you convert legacy encoded data to utf8 on input. Check the POD for details.

(updated to fix a minor glitch involving code points gt "\x{2FA1F}")
(second update: added logic to check for malformed utf8 and adjust output if needed) (third update: major revision (thanks to Jim) to use unicore/Blocks.txt instead of __DATA__; also added "--names" option, changed "--chart" option to "--blocks", and altered spacing of output table columns)

#!/usr/bin/perl

=head1 NAME

unichist

=head1 SYNOPSIS

unichist [--blocks] [--names] [--hex[only]] [--top=N] [--enc=ENC] [fil
+e ...]

 --blocks  : print summary of character counts per Unicode "block"
 --names   : print full Unicode character name for each code point
 --hex     : include the Unicode hexadecimal code point value
 --hexonly : show the hex code point instead of the character itself
 --top=N   : print only the N most frequent characters
 --enc=ENC : convert input from ENC to utf8 (def: input is utf8)

=head1 DESCRIPTION

For text data provided on STDIN or in one or more files named on the
command line, this program will print the list of characters occurring
in the data, together with the frequency of occurrence for each
character.

By default, input is assumed to be in utf8, and all characters present
in the input are counted and listed on STDOUT, one character per line,
in their "standard" order (i.e. the numeric ordering determined by
their Unicode code point values), with the number of occurrences
following each character.

If the input uses some known encoding other than utf8, simply name the
encoding with the "--enc=..." option (e.g. "--enc=cp1252" or
"--enc=UTF-16LE").  If the value of this option is not recognized as a
known encoding, the program exits with an error message listing the
known encodings.  (Note that ASCII is a proper subset of utf8; the
--enc option is only needed when data is neither ASCII nor utf8.)

Output is always in utf8 (but with the "--hexonly" option, it will
always be just plain ASCII).

The --hex option will put the hexadecimal code point value at the
beginning of each output line, followed by the utf8 character and then
its frequency.  Use --hexonly to output just the hexadecimal value and
frequency without the actual utf8 character itself (useful when your
display window is unable to handle utf8 data correctly).

The "--top=N" option will cause only the N most frequent characters to
be listed, in descending order of frequency.  If N is 0, no characters
will be listed (useful in combination with the "--blocks" option,
described below).  If N is -1, all characters will be listed in
descending order of frequency (instead of the default code-point
order).

The "--blocks" option will produce a supplemental set of output lines,
breaking the character counts into groups according the the "block"
pages defined by the Unicode Standard "Blocks.txt" file that comes
with your version of Perl.  Usually, a given data file would contain
characters spanning just a few distinct charts, and this form of
listing is useful when checking for "outlier characters".  To see only
this summary of character counts by chart, use this option in
combination with "--top=0", to turn off the listing of individual
characters with their frequencies.

The "--names" option will include the full Unicode name (if any) for
each code point listed in the one-character-per-line output.  Note
that some blocks of characters (e.g. the Unified CJK set) do not have
individual names for each code point.

=head1 LIMITATIONS/BUGS

Be aware that some byte sequences that are parsable as utf8 do not
correspond to defined characters -- there are gaps in the code point
sequence.  When Unicode explicitly omits some ranges of code points
as "unassigned" in "Blocks.txt", characters that fall within these
unassigned ranges will always be listed by their hex value only in
the default output, and grouped together into a single "unassigned"
class when the "--blocks" option is used.  But there are some
"unassigned" code points that we are not able to identify as such.
When in doubt, use the "--hexonly" option, and check the output
against the code charts as published at www.unicode.org/charts/.

It's also possible that unicode input might contain characters in a
region called the "Private Use Area" (UE000-UF8FF), which means that
the correct label or interpretation of the character depends on the
whim of whoever created the data.  Another possible anomaly for utf8
input is the presence of "characters" in the so-called "Surrogate Area
+"
(UD800-UDFFF), which indicates an encoding error by whoever created
the data.  In both cases, the default output will show only the hex
values of the observed code points in these ranges, and the "--blocks"
output will indicate these regions as "Private Use" or "Surrogate".

=head1 OTHER NOTES

The default output listing uses the symbolic names NUL, TAB, LFD, RTN,
SPC, NBSP and DEL for the null-byte, the five common whitespace
characters and \x7F, respectively; other ASCII and Latin-1 control
characters are presented in "\xHH" notation.

=head1 AUTHOR

David Graff ( graff (at) ldc (dot) upenn (dot) edu )

=cut

use strict;
use Encode;
use Getopt::Long;

my $Usage = "Usage: $0 [--blocks] [--names] [--hex[only]] [--top=N] [-
+-enc=ENC] [file ...]\n";

my %opt;
GetOptions( \%opt, 'blocks', 'names', 'hex', 'hexonly', 'enc=s', 'top=
+i' )
    or die $Usage;

die $Usage if ( @ARGV == 0 and -t );

my $inmode = ':utf8';
if ( $opt{enc} ) {
    my @enclist = Encode->encodings(":all");
    listEncodings( $Usage, @enclist ) unless ( grep /$opt{enc}/, @encl
+ist );
    $inmode = ":encoding($opt{enc})";
}

binmode STDOUT, ':utf8';
my ( %names, %char_hist, %class_hist, %class_def, %unassigned );

for my $name ( split /^/, do 'unicore/Name.pl' ) {
    chomp $name;
    my ( $h, $n ) = split /\t/, $name, 2;
    $names{chr(hex($h))} = $n;
}

# load the definitions of "chart" character classes
#  and unassigned ranges:

my $last_end = -1;
( my $blocks_path = $INC{'unicore/Name.pl'} ) =~ s/Name.pl/Blocks.txt/
+;
open( BLKS, "<", $blocks_path );
while (<BLKS>) {
    next unless ( /^([0-9A-F]+)\.\.([0-9A-F]+); (.*)/ );
    my ( $bgn, $end, $name ) = ( $1, $2, $3 );
    $bgn = chr( hex( $bgn ));
    $class_def{$bgn}{limit} = chr( hex( $end ));
    $class_def{$bgn}{title} = $name;
    if ( $last_end+1 != ord( $bgn )) {
        for my $val ( $last_end+1 .. ord($bgn)-1 ) {
            $unassigned{chr($val)} = undef;
        }
    }
    if ( $name =~ /Surrogate|Privat Use/ ) {
        for my $chr ( $bgn .. $class_def{$bgn}{limit} ) {
            $unassigned{$chr} = $name;
        }
    }
    $last_end = hex( $end );
}

my $max_bin = 0;
if ( @ARGV == 0 ) {
    binmode STDIN, $inmode;
    $max_bin = count_chars( \%char_hist );
} else {
    my $lcl_max;
    for my $file ( @ARGV ) {
        $lcl_max = count_chars( \%char_hist, $file, $inmode );
        $max_bin = $lcl_max if ( $max_bin < $lcl_max );
    }
}

my $nwidth = 1 + int( log( $max_bin ) / log( 10 ));
my $nonchrformat = "%04X %${nwidth}d  %s\n";
my $lblformat = ( $opt{names} ) ? "\t%${nwidth}d  %s" : "\t%${nwidth}d
+";
my $chrformat = ( $opt{hexonly} ) ? "%04X$lblformat\n" :
                ( $opt{hex} ) ? "%04X %4s$lblformat\n" : "%4s$lblforma
+t\n";

# check for malformed characters and adjust histogram if needed
for my $char ( keys %char_hist ) {
    my $charnum = ( $char eq "\x00" ) ? 0 : ord( $char ) || 0xFFFD;
    if ( $charnum == 0xFFFD ) {
        $char_hist{chr($charnum)} += $char_hist{$char};
        delete $char_hist{$char};
    }
}

my @outorder;
if ( not exists( $opt{top} )) {
    @outorder = sort keys %char_hist;
}
elsif ( $opt{top} != 0 ) {
    $opt{top} = scalar keys %char_hist if ( $opt{top} < 0 );
    @outorder = ( sort {$char_hist{$b} <=> $char_hist{$a}}
                        keys %char_hist )[0 .. $opt{top}-1];
}
my %symbols = ( " " => 'SPC',
                 "\t" => 'TAB',
                 "\n" => 'LFD',
                 "\r" => 'RTN',
                 "\x00" => 'NUL',
                 "\x7F" => 'DEL',
                 "\xA0" => 'NBSP',
              );

for my $char ( @outorder ) {
    my $chrnum = ord( $char );
    if ( exists( $unassigned{$char} ) or $chrnum > $last_end ) {
        my $status = $unassigned{$char} || 'unassigned';
        printf( $nonchrformat, $chrnum, $char_hist{$char}, $status );
        next;
    }
    my @args = ();
    push @args, $chrnum if ( $chrformat =~ /X/ );
    if ( ! $opt{hexonly} ) {
        push @args, (( exists( $symbols{$char} )) ? $symbols{$char} :
                      ( $char lt ' ' || $char =~ /\x80-\x9F/ ) ? 
                       sprintf( "\\x%02x", $chrnum ) : $char );
    }
    push @args, $char_hist{$char};
    push @args, $names{$char} if ( $opt{names} );
    printf( $chrformat, @args );
}
if ( $opt{blocks} ) {
    count_classes( \%char_hist, \%class_def, \%unassigned, \%class_his
+t );
    print "\n";
    for my $class ( sort keys %class_def ) {
        printf( "%04x-%04x %d\t%s\n",
                ord($class), ord($class_def{$class}{limit}),
                $class_hist{$class}, $class_def{$class}{title} )
            if ( $class_hist{$class} );
    }
    printf( "xxxx-xxxx %d\t unassigned\n", $class_hist{unassigned} )
        if ( $class_hist{unassigned} );
}

sub count_chars
{
    my ( $hist, $file, $mode ) = @_;
    my $fh;
    if ( defined $file ) {
        open( $fh, "<$mode", $file ) or die "$file: $!";
    } else {
        $fh = \*STDIN;
    }
    while ( <$fh> ) {
        for my $ch ( split // ) {
            $$hist{$ch}++;
        }
    }
    my $max = 0;
    for my $c ( keys %$hist ) {
        $max = $$hist{$c} if ( $max < $$hist{$c} );
    }
    return $max;
}

sub count_classes
{
    my ( $ch_hist, $cl_def, $non_chr, $cl_hist ) = @_;

    my @start = sort keys %$cl_def;
    my $bgn = shift @start;
    for my $chr ( sort keys %$ch_hist ) {
        if ( exists( $$non_chr{$chr} )) {
            my $class = $$non_chr{$chr} || 'unassigned';
            $$cl_hist{$class}++;
            next;
        }
        while ( @start and $chr gt $$cl_def{$bgn}{limit} ) {
            $bgn = shift @start;
        }
        if ( $chr gt $$cl_def{$bgn}{limit} ) {
            $$cl_hist{unassigned} += $$ch_hist{$chr};
        } else {
            $$cl_hist{$bgn} += $$ch_hist{$chr};
        }
    }
}

sub listEncodings
{   # user is asking for help: list all available encodings
    my ( $Usage, @enclist ) = @_;
    my $colwidth = length( (sort {length($b) <=> length($a)} @enclist)
+[0] ) + 2;
    my $ncol = int( 80/$colwidth );
    my $nrow = int( scalar(@enclist)/$ncol );
    $nrow++ if ( scalar(@enclist) % $ncol );
    my $fmt = "%-${colwidth}s";

    print $Usage, "\n  Acceptable values for ENC are:\n";
    foreach my $r ( 0 .. $nrow ) {
        foreach my $c ( 0 .. $ncol ) {
            my $i = $c * $nrow + $r;
            printf( $fmt, $enclist[$i] );
        }
        print "\n";
    }
    exit( 0 );
}

Comment on unichist -- count/summarize characters in data
Download Code

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://573646]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (15)
As of 2014-07-31 13:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (248 votes), past polls