Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

dict-compare: a dictionary evaluation script

by allolex (Curate)
on Sep 03, 2003 at 18:07 UTC ( #288692=sourcecode: print w/ replies, xml ) Need Help??

Category: Text Processing
Author/Contact Info /msg allolex
Description:

As promised in Constructive criticism of a dictionary / text comparison script, here is a cleaned-up version of the dictionary comparison code that the monks helped me with.

What follows I just copied out of the POD in the script itself. It might make this code easier to find.

A generic script for building dictionaries by comparing them to real-world texts.

This program compares the words in a given text file to a list of words from a dictionary file. It is capable of outputting lists of words that occur or do not occur in a given dictionary file, along with their frequency in the text. Debugging output using token tag marks is also available.

#!/usr/bin/perl

#  POD can be found at the bottom of this script

use strict;
use warnings;
use Compress::Zlib;
use Getopt::Long;
use Pod::Usage;

my $VERSION = 0.81;
my $dictfile = 'dict.gz';

#  Process command-line options

my %cl_options = (
    help             =>     '',
    version         =>     '',
    token_debug         =>     '',
    glossary_output     =>     '',
    dictionary_output     =>     ''
);

GetOptions(
        'help|?'      => \$cl_options{help},
        'version'     => \$cl_options{version},
        'man'         => \$cl_options{man},
        'token-debug' => \$cl_options{token_debug}, 
        'glossary'    => \$cl_options{glossary_output},
        'dictionary'  => \$cl_options{dictionary_output}    
);

print "This is version $VERSION of $0.\n" if $cl_options{version};
exit(0) if ($cl_options{version});
pod2usage(-exitstatus => 0, -verbose => 1, -msg => "Help for $0") if $
+cl_options{help};
pod2usage(-exitstatus => 0, -verbose => 2, -msg => "Man page for $0") 
+if $cl_options{man};

my $file = shift;
my %dictionary = readdict(\$dictfile);
my %glossary;

findwords();

printlexicon(\%dictionary) if $cl_options{dictionary_output};
printlexicon(\%glossary) if $cl_options{glossary_output};


#  Readdict reads in the dictionary file defined above using
#  the Compress:Zlib CPAN module.  It returns a hash that is
#  used for all further dictionary operations.
#
sub readdict {
    my $dict = shift;
    my %dicthash;

    my $gz = gzopen($$dict, "rb") or die "Cannot open $$dict: $gzerrno
+\n" ;
    while ($gz->gzreadline($_) > 0) {
        chomp;
        $dicthash{lc($_)} = 0;
    }
    die "Error reading from $$dict: $gzerrno\n" if $gzerrno != Z_STREA
+M_END ;
    return %dicthash;
}

#  findwords() reads in a file and compares words found in the file
#  with the contents of the dictionary read in by the readdict
#  function.  It assigns counts to the elements of %dictionary and
#  creates %glossary elements and increases its values according to
#  the number of matches.

sub findwords {
    open my $if, "<", $file || die "Could not open $file: $!";
    while (<$if>) {
        chomp;
    my @elements = split(/[ '-]/,$_); # split on hyphens, too
        foreach my $element (@elements) {
            next if $element =~ /\d/; #  Don't need digits
            print "[$element]->" if $cl_options{token_debug};
            $element = lc($element);
            $element =~ s/[\s,!?._;)("'-]//g; 
            print "[$element]\n" if $cl_options{token_debug};
            next if $element eq '';
            if ( exists $dictionary{$element} ) {
                $dictionary{$element}++;
            } else {
                $glossary{$element}++;
            }
        }
    }
}

#  Showmatches reads in a lexicon hash via a reference and prints all 
+words out 
#  that have been seen in the findwords() function along with a freque
+ncy count.
#
sub  printlexicon {
    my $lexicon = shift;
    my $counter = 0;
    foreach my $key (sort keys %$lexicon) {
        if ( $$lexicon{$key} > 0 ) {
            print $key . " : " . $$lexicon{$key} . "\n";
            $counter++;
        }
    }
    print "\n$counter entries total\n";
}

__END__

=pod

=head1 dict-compare

A generic script for building dictionaries by comparing them to real-w
+orld texts.

=head1 DESCRIPTION

This program compares the words in a given text file to a list of word
+s from
a dictionary file.  It is capable of outputting lists of words that oc
+cur or
do not occur in a given dictionary file, along with their frequency in
+ the
text.  Debugging output using token tag marks is also available.

=head1 SYNOPSIS

C<dict-compare [--glossary --dictionary] [--token-debug] file > output
+_file>

=head2 OPTIONS 

=over 12

=item C<--help,-h,-?>

Prints a usage help screen.

=item C<--man,-m>

Prints out the manual entry for $0

=item C<--version,-v>

Prints out the program version.

=item C<--glossary>

Prints a glossary of words not found in the dictionary file and the nu
+mber of
times they occur.

=item C<--dictionary>

Prints out the words from the text that had a dictionary match, along 
+with
their respective frequencies.

=item C<--token-debug>

Prints tags around each token in the text to help sound out strange to
+kens.
The tokens themselves are printed side-by-side to show how the script 
+cleans
up the results.

=back

=head1 EXAMPLE

C<dict-compare --glossary myfile.txt>

This command reads in the text contained in myfile.txt and prints out 
+a list
of words not found in the dictionary and their frequencies.

=back

=head1 DICTIONARY FORMAT

The dictionary is a one-word-per-line file that has been gzipped.  You
+r dictionary can be anything.  Think of the possibilities.

=head1 THANKS

The following people have reviewed and offered inprovements to this co
+de:

=over 12

=item B<Sauoq> L<http://www.perlmonks.org/index.pl?node_id=182681>

=item B<adjelore> L<http://www.perlmonks.org/index.pl?node_id=131479>

=item B<Hutta> L<http://www.perlmonks.org/index.pl?node_id=117788>

=item B<TomDLux> L<http://www.perlmonks.org/index.pl?node_id=144696>

=item B<Not_A_Number> L<http://www.perlmonks.org/index.pl?node_id=2587
+24>

=back

And of course all of the others at the Monastery, Cologne.pm whose hel
+p
can only be seen in its cumulative effect.

=head1 AUTHOR

Damon "allolex" Davison - <allolex@sdf.freeshell.org>

=head1 LICENSE

This code is released under the same terms as Perl itself.

=cut

Comment on dict-compare: a dictionary evaluation script
Download Code

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://288692]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2015-07-06 05:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (70 votes), past polls