allolex has asked for the wisdom of the Perl Monks concerning the following question:
I originally wrote this dictionary comparison tool as part of an
ongoing linguistics project. The script compares a text file with
a compressed dictionary file (one word per line) and spits out various
bits of information. You can use it to get a list of the words in your
text that match the dictionary, the words that do not match the
dictionary, and to print out debugging information if strange tokens
are printing out in your word lists. For the word list options, it also
prints out the number of matches for a particular token.
the script is useful because it is not possible for any single dictionary
to serve all needs. This script can quickly show how a well a dictionary matches
the texts it is used on. (For the linguists out there, think about the possibilities
of a lexicon that only covers a particular word field or word set and allows you to
compare that with any given text.)
Basically, what I am looking for is a critique of my code and style,
turning this code (which does work, BTW) into a learning experience for
me. So here is the whole thing (including POD) in <readmore>
tags. Thanks in advance.
PS: I plan to put this code in the Catacombs once it has undergone
sufficient peer review... :)
#!/usr/bin/perl
use strict;
use warnings;
use Compress::Zlib;
use Getopt::Long;
use Pod::Usage;
my $VERSION = 0.7;
my $dictfile = 'dict.gz';
# Process command-line options
my $help = '';
my $man = '';
my $version = '';
my $token_debug = '';
my $glossary_output = '';
my $dictionary_output = '';
GetOptions( 'help|?' => \$help, 'version' => \$version, 'man' => \$man
+, 'token-debug' => \$token_debug, 'glossary' => \$glossary_output, 'd
+ictionary' => \$dictionary_output );
print "This is version $VERSION of $0.\n" if $version;
exit(0) if ($version);
pod2usage(1) if $help;
pod2usage(-exitstatus => 0, -verbose => 2) if $man;
my $file = shift;
my %dictionary = readdict(\$dictfile);
my %glossary;
findwords();
printlexicon(\%dictionary) if $dictionary_output;
printlexicon(\%glossary) if $glossary_output;
# Readdict reads in the dictionary file defined above using
# the Compress:Zlib CPAN module. It returns a hash that is
# used for all further dictionary operations.
#
sub readdict {
my $dict = shift;
my %dicthash;
my $gz = gzopen($$dict, "rb") or die "Cannot open $$dict: $gzerrno
+\n" ;
while ($gz->gzreadline($_) > 0) {
chomp;
$dicthash{lc($_)} = 0;
}
die "Error reading from $$dict: $gzerrno\n" if $gzerrno != Z_STREA
+M_END ;
return %dicthash;
}
# findwords() reads in a file and compares words found in the file
# with the contents of the dictionary read in by the readdict
# function. It assigns counts to the elements of %dictionary and
# creates %glossary elements and increases its values according to
# the number of matches.
#
sub findwords {
open my $if, "<", $file || die "Could not open $file: $!";
while (<$if>) {
chomp;
my @elements = split(/[ ']/,$_);
foreach my $element (@elements) {
next if $element =~ /[^A-Za-zÀ-ÿ]/; # Don't need digits
$element = lc($element);
$element =~ s/[\s\,\!\?\.\-\_\;\)\(\"\']//g;
next if $element eq '';
print "[$element]\n" if $token_debug;
# If the word matches a word in the dictionary, increase
# the match count by one, otherwise assign it to the
# glossary of words not found in the dictionary and up
# the glossary count.
if ( exists $dictionary{$element} ) {
$dictionary{$element}++;
} else {
$glossary{$element}++;
}
}
}
}
# Showmatches reads in a lexicon hash via a reference and prints all
+words out
# that have been seen in the findwords() function along with a freque
+ncy count.
#
sub printlexicon {
my $lexicon = shift;
my $counter = 0;
foreach my $key (sort keys %$lexicon) {
if ( $$lexicon{$key} > 0 ) {
print $key . " : " . $$lexicon{$key} . "\n";
$counter++;
}
}
print "\n$counter entries total\n";
}
__END__
=pod
=head1 dict-compare
A generic script for building dictionaries by comparing them to real-w
+orld texts.
=head1 SYNOPSIS
C<dict-compare [--glossary --dictionary] [--token-debug] file > output
+_file>
=head1 DESCRIPTION
This program compares the words in a given text file to a list of word
+s from
a dictionary file. It is capable of outputting lists of words that oc
+cur or
do not occur in a given dictionary file, along with their frequency in
+ the
text. Debugging output using token tag marks is also available.
=head2 Command-Line Options
=over 12
=item C<--help,-h,-?>
Prints a usage help screen.
=item C<--man,-m>
Prints out the manual entry for $0
=item C<--version,-v>
Prints out the program version.
=item C<--glossary>
Prints a glossary of words not found in the dictionary file and the nu
+mber of
times they occur.
=item C<--dictionary>
Prints out the words from the text that had a dictionary match, along
+with
their respective frequencies.
=item C<--token-debug>
Prints tags around each token in the text to help sound out strange to
+kens.
=back
=head1 EXAMPLE
C<dict-compare --glossary myfile.txt>
This command reads in the text contained in myfile.txt and prints out
+a list
of words not found in the dictionary and their frequencies.
=back
=head1 AUTHOR
Damon "allolex" Davison - <allolex@sdf.freeshell.org>
=head1 LICENSE
This code is released under the same terms as Perl itself.
=cut
NB: If you want to reproduce the dictionary so you can actually run this
script as-is, *nix users can take the words file (/usr/share/dict/)
and compress it as dict.gz using gzip. Alternatively, you could just write five lines/five words in a text editor and compress it...
--
Allolex
Re: Constructive criticism of a dictionary / text comparison script
by sauoq (Abbot) on Aug 29, 2003 at 23:21 UTC
|
All in all, it looks fine. And the fact that it works is a big point in its favor. :-) The one thing that immediately stood out to me was your backwhack happiness in the character class in this line:
$element =~ s/[\s\,\!\?\.\-\_\;\)\(\"\']//g;
There is nothing wrong with writing that as
[\s,!?._;("'-]
instead. (Note that the dash ('-') should be last.) Many regex metachars are just fine inside a character class. You really only need to be careful with '\', ']', '-', and '^'. (I.e. the character class: []^\\-])
Also, on a different line you used a literal space inside the character class. That's fine but sometimes it is easier to read if you use \x20 instead.
-sauoq
"My two cents aren't worth a dime.";
| [reply] [d/l] [select] |
|
Yes. Definitely backslash-happy. This is something that I have wondered about, but never really remembered to look up or ask. It's much easier to read your way. No fear, I will not go defining character class ranges with the dash. :) And using \x20 instead of a literal space is something that never occurred to me before, but seems like such an obviously good idea, that I'll now probably write a bunch of scripts that totally overuse it. ;)
--
Allolex
| [reply] |
Re: Constructive criticism of a dictionary / text comparison script
by ajdelore (Pilgrim) on Aug 29, 2003 at 22:50 UTC
|
This is more a suggestion on functionality than a critique of code. One thing that I ran into with my boggle script is that the unix dict file doesn't have variants of words. For example, it has huge but not hugely, fish but not fishes or fishing, etc.
Ideally, you would have some kind of functionality to address this. One possibility is to stem words before you check them. I know that Lingua::Stem implements one popular algorithim to do this. I didn't look into it close enough to see if it would do the trick for me.
</ajdelore>
| [reply] |
|
I really like your idea and it would work very well if I were dealing with texts languages that all had a stemming module. I am seriously considering writing one for French. Currently, I am working with Italian, which does have Lingua::Stem::It, but my dictionary has word forms as well. The huge advantage of working with a stemmer is that it is also capable of stemming novel constructions (like stemage), which the dictionary does not account for. It would be a very interesting modification to create a dictionary of stem forms, but it would also be a lot more work checking its accuracy.
What would really be cool is a stemming module that defined all affixes via a hash of some kind, so that tense, mode/mood, plural, person, etc. could be looked up like
my %hash_of_verb_suffixes = (
future => qw([ei]rò [ei]rai [ei]rà [ei]remo [ei]rete [ei]ranno),
conditional => qw([ei]rei [ei]resti [ei]rebbe [ei]remmo [ei]reste [
+ei]rebbero)
)
and so on.
Oh, wait. That's a POS tagger;)
In any case, I can see we think along similar lines. Thanks!
--
Allolex
| [reply] [d/l] |
Re: Constructive criticism of a dictionary / text comparison script
by Hutta (Scribe) on Aug 29, 2003 at 23:41 UTC
|
I normally wouldn't be this picky, but you were asking for comments on coding style, so I'd have this to offer:
GetOptions(
'help|?' => \$help,
'version' => \$version,
'man' => \$man,
'token-debug' => \$token_debug,
'glossary' => \$glossary_output,
'dictionary' => \$dictionary_output
);
This tends to be an easier-to-read way to format hash-like structures.
I also tend to put all of my commandline options into a single hash (%opt, or %arg, or something) when they're going to be global. This saves variable namespace, but may be a petty concern. | [reply] [d/l] |
|
Yes, that looks is a lot easier to read than the separate option declarations. Actually, I thought I reformatted that before I posted it here... Go figure :) I also like the idea of declaring the option init values in a hash, since it would make the code more legible (not to mention namespace economy). I'll definitely make both of those changes.
--
Allolex
| [reply] |
|
Well, you bring up a good point. Problem with passing hashes to say, something like functions, is when you misslpell variables/keys, you wind up with a maintenance issue. so when you write a function like,
add(-number1=>10,-nuber2=>0),
you get a right result, but not the right way. Granted, this is the easist bug to bring out, try adding 5 and 3 to get 0. But when you do it with more complex scenarios, you can get really weird software bugs. Also, it makes it harder to refactor code, when you wish to remove parameters, add them or make different requirements, 'cuz when they get called, they may not break.. unless you check for every old parameter and new one in your functions. Yuck. Just a rant :)
| [reply] [d/l] |
Re: Constructive criticism of a dictionary / text comparison script
by Not_a_Number (Prior) on Aug 30, 2003 at 08:49 UTC
|
Hi allolex. There is a problem that nobody has yet mentioned. It concerns this line:
next if $element =~ /[^A-Za-zÀ-ÿ]/;
This is doing a lot more than you want it too, I think. Basically, it means "ignore any $element containing a character not in the set defined between square brackets". It is therefore stripping out, for example, any 'word' with attached punctuation. For example, in a sentence such as:
"Shut up!" he said.
you are throwing away three quarters of your 'words'! And you are also, of course, ignoring hyphenated words
It also means that the line:
$element =~ s/[\s\,\!\?\.\-\_\;\)\(\"\']//g;
never actually does anything, with or without surplus backslashes...
hth
dave
| [reply] [d/l] [select] |
|
sub findwords {
open my $if, "<", $file || die "Could not open $file: $!";
while (<$if>) {
chomp;
my @elements = split(/[ '-]/,$_); # split on hyphens, too
foreach my $element (@elements) {
next if $element =~ /\d/; # Don't need digits
$element = lc($element);
$element =~ s/[\s,!?._;)("'-]//g; # thanks sauoq
next if $element eq '';
print "[$element]\n" if $token_debug;
if ( exists $dictionary{$element} ) {
$dictionary{$element}++;
} else {
$glossary{$element}++;
}
}
}
}
Thanks a lot! I think that was another relic from a previous version. I'm glad you caught it.
--
Allolex
| [reply] [d/l] |
Re: Constructive criticism of a dictionary / text comparison script
by TomDLux (Vicar) on Aug 30, 2003 at 02:20 UTC
|
Initializing variables to an empty string is generally no advantage. Why not collapse those declarations into one line?
my ( $help, $man, $version, $token_debug, $glossary_output, $dictionar
+y_output );
# or
my ( $help, $man, $version, $token_debug, $glossary_output, $dictionar
+y_output ) = ( '', '', '', '', '', '' );
GetOptions will also take a reference to a hash as its first argument, instead of all the variables. Much more concise and you don't have to worry three pages down what that variable was called.
--
TTTATCGGTCGTTATATAGATGTTTGCA | [reply] [d/l] |
|
| [reply] |
|
If you're gonna initialize all those vars to the same value, you might as well do it using x, as in
my( $help,
$man,
$version,
$token_debug,
$glossary_output, $dictionary_output ) = ('') x 6;
MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!" | I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README). | ** The third rule of perl club is a statement of fact: pod is sexy. |
| [reply] [d/l] |
Re: Constructive criticism of a dictionary / text comparison script
by halley (Prior) on Aug 30, 2003 at 15:44 UTC
|
| [reply] |
|
Thanks. I didn't know about the project, but I don't work with English much at all. What might be interesting is to compile a list of similar resources for other languages as well. I wish there were a compendium linguisticae for precisely this sort of thing, but I think there are too many people working on very particular projects for this to happen. Maybe me... someday.
Before I completely forget and go off on another tangent (they kind of happen to me a lot), have you seen Kevin's Word List Page? There are some really interesting specialty lists there.
--
Allolex
| [reply] |
Re: Constructive criticism of a dictionary / text comparison script
by allolex (Curate) on Sep 03, 2003 at 18:15 UTC
|
| [reply] |
|
|