Estimating Vocabulary

The little woman was concerned about the size of our youngster vocabulary. I told her 'Relax, he knows thousands of words'. But how many does he really know? Of course a Perl program can help estimate.

This program prints a sample of the dictionary. You count how many words of the sample are known and multiply by the multiplier that is shown to estimate the size of the vocabulary. I found it easiest to redirect output to a file, use vi to delete all the unknown words and count whats left over.

On four runs of 64 word samples, I gave my boy credit for knowing 19, 18, 19 and 17. 46239 words in my dictionary gave me a multiplier of 722.5. 18 * 722.5 is about 13000 words.

Now I think I'll go give wifey the test, see if she's satisfied with her score. :)

The code is nothing spectacular, anybody here could write it, and I'm sure many could make it a one-liner, but I think it's a CUFP nonetheless.

YuckFoo


#!/usr/bin/perl

   use strict;

   my $DICT = '/usr/dict/words';

   my ($num) = @ARGV;
   my ($i, $total, $mult, @words);

   $num ||= 100;

   if (!open(IN, $DICT)) {
      print "\n$! - $DICT\n\n"; exit;
   }

   $total = @words = <IN>;
   $mult = $total / $num;

   for $i (1..$num) {
      print splice(@words, rand(@words), 1);
   }

   print "$total words in dictionary.\n";
   print "Multiply number of words known in this list by $mult.\n";
[download]

Comment on Estimating Vocabulary Download Code

Replies are listed 'Best First'.
Re: Estimating Vocabulary by belg4mit (Prior) on Mar 27, 2002 at 03:36 UTC
Well I suppose that depends on your defintion of word. am, are, is, was - are these each words? Also IIRC the English language is purported to have a lexicon on the order of 320,000 words. The average American vocabulary has been in steady decline since the early twentieth century at which point I believe it was on the order of several thousand words. A few things to consider: dictionaries may contain archaic forms does your dictionary contain proper nouns? do you care? the content of the language is not evenly distributed across the lexicon, e.g. a single word (sans modifiers) for "love" and a plethora for shades of blue. * I shall attempt to find evidence to support this. An enlightening thread, but then again it is usenet... Apparently this is a pretty hotly contested topic. `-- perl -pe "s/\b;([st])/'\1/mg"`	[reply]
Re: Re: Estimating Vocabulary by YuckFoo (Abbot) on Mar 27, 2002 at 04:06 UTC
Good points all, belg4mit. * If the sample is large enough, the correct percentage of archaic words will be in the sample, it'll work itself out. * I had already removed proper nouns, nouns containing any uppercase letter. I should have noted that, but again I'm not sure it matters with a large enough sample. * I'm not sure how words should really be counted, still looking for a reference myself. For my purpose, I am considering run, runs, ran, running as unique words. I'm just looking for a ballpark number. It seems like a good ballpark to me that if the boy consistently knows 20-25% of the words in the sample, he should know 20-25% of the words in $DICT. If anyone has pointers to real vocabulary development numbers and counting methods, I'd like to get'em. YuckFoo	[reply]
Re: Estimating Vocabulary by belg4mit (Prior) on Mar 27, 2002 at 07:21 UTC
Well here's an alternate, a complete waste of cycles as it scales linearly with the number of words returned, OTOH it is not bounded by the size of the dictionary. (As is) It can also return duplicates, yada yada yada. `my(@lines, $line); open(FILE, shift) \|\| die; until( scalar @lines == $ARGV[0] ){ seek(FILE, 0, $. = 0); rand($.) < 1 && ($line = $_) while <FILE>; push(@lines, $line); } print @lines, "wc -l could have told you this is $. words\n";` [download] It's based on "How do I select a random line from a file?" in perlfaq5. I'd be interested in seeing if anybody else has a better means of extending this algorythm to report multiple entries. `-- perl -pe "s/\b;([st])/'\1/mg"`	[reply] [d/l]
Re: Re: Estimating Vocabulary by I0 (Priest) on Mar 28, 2002 at 00:41 UTC
`my(@lines, $line); open(FILE, shift) \|\| die; 1 while <FILE>; $line=$.; seek(FILE, 0, $. = 0); rand($line-$.) < $ARGV[0]-@lines && push(@lines,$_) while <FILE>; print @lines, "wc -l could have told you this is $. words\n";` [download]	[reply] [d/l]
Re: Re: Re: Estimating Vocabulary by belg4mit (Prior) on Mar 28, 2002 at 01:00 UTC
UPDATE:Excellent! WAS: That does not appear to work, I ask for one line and get 13-18 lines... It is also heavily weighted towards the Zs `-- perl -pe "s/\b;([st])/'\1/mg"`	[reply]
Re: Re: Re: Re: Estimating Vocabulary by I0 (Priest) on Mar 28, 2002 at 01:39 UTC
Re5: Estimating Vocabulary by belg4mit (Prior) on Mar 28, 2002 at 01:43 UTC
Some notes below your chosen depth have not been shown here


go ahead... be a heretic
	PerlMonks