http://www.perlmonks.org?node_id=556117

sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys.

Typically I code using the easiest-to-read-and-understand method but I was just playing with an idea earlier and was curious on the "better" ways you can come up with to solve this problem.

You have a dictionary file of, say, 1 million words. You want to write a script to take RANDOMLY 100 words of each letter (a .. z). You want 100 words beginning with A, 100 words beginning with B, etc.

This is what I did. It's untested and doesn't need to work since I'm looking for how you guys would solve this.

open(DICT, "dictionary.txt" or die "Error: $!"; my @words = <DICT>; close(DICT) or die "Error: $!"; @words = shuffle(@words); # assume shuffle is defined my @new_words; my @letters = (a .. z); my @cnt = 0; foreach my $letter (@letters) { $cnt = 0; foreach my $word (@words) { if ($cnt > 100) { last; # break loop, our 100 word is found } if ($word =~ m/^$letter/i) { $cnt++; push (@new_words, $word); } } }


"Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

sulfericacid

Replies are listed 'Best First'.
Re: improving the efficiency of a script
by Zaxo (Archbishop) on Jun 18, 2006 at 16:48 UTC

    If this is going to be used a lot, I'd either stuff the dictionary file into a database, or else construct an index to the offsets and sizes of initial letter sections of the dictionary file.

    Assuming the dictionary file is alphabetically sorted, you don't need to slurp the whole file into an array. That is a large chunk of memory for a million words. Allocations that size will slow you painfully if you are driven into swap.

    Try just building an array with the a's, shuffling, and taking the first hundred elements. Then discard the a's and replace with the b's, all in a while loop that only reads one line at a time.

    You don't need a loop to pick the first hundred elements of an array. A slice will do,

    @array[0..99]
    and is much faster.

    After Compline,
    Zaxo

      Zaxo,
      This is very similar to the idea I had. I discovered that compiling the offsets using DBM::Deep was extremely slow, but was fast for subsequent runs. This also has the advantage of not requiring the dictionary file to be sorted.
      #!/usr/bin/perl use strict; use warnings; use DBM::Deep; open(my $dict, '<', 'words.raw') or die "Unable to open 'words.raw' fo +r reading: $!"; my $db = DBM::Deep->new("offsets.db"); build_db($db, $dict) if ! scalar keys %$db; for my $char ('a' .. 'z') { for (1 .. 100) { print get_rand_word($db, $char, $dict); } } sub build_db { my ($db, $dict) = @_; my $pos = tell $dict; while ( <$dict> ) { my $char = substr($_, 0, 1); push @{$db->{$char}}, $pos; $pos = tell $dict; } } sub get_rand_word { my ($db, $char, $dict) = @_; my $offset = $db->{$char}[rand @{$db->{$char}}]; seek $dict, $offset, 0; my $word = <$dict>; return $word; }
      Other options include Storable and DBD::SQLite if a real RDBMS isn't available.

      Cheers - L~R

Re: improving the efficiency of a script
by lima1 (Curate) on Jun 18, 2006 at 16:32 UTC
    if your dictionary is sorted, you could create a "map" of your dictionary with binary search: e.g. search for "b", "c" etc (most implementations return the position of the query in the dict., even when query is not found). it is log(n). so m * log(n) for the map and m * 100 for the word selection.

    your solution would be m * n.

      I like your solution, lima1.

      Here's an example of how one might implement it:

      #!/usr/bin/perl -w # Strict use strict; use warnings; # User-defined my $wordfile = "words.txt"; # Libraries use FileHandle; use File::Basename; use Data::Dumper; # Main program my $iam = basename $0; # Read words, saving file offsets my $fh = new FileHandle; open($fh, '<', $wordfile) or die "$iam: can't read $wordfile ($!)\n"; my ($word, %word_offsets_by_letter); while (1) { my $offset = tell($fh); defined($word = <$fh>) or last; chomp $word; if ($word =~ /^(.)/) { my $first_letter = lc $1; $word_offsets_by_letter{$first_letter} ||= [ ]; push @{$word_offsets_by_letter{$first_letter}}, $offset; } } # Test (this will give 100 random words beginning with 'a') foreach (1..100) { my $next = random_word_starting_with("a", $fh); print "Next word => $next\n"; } # Subroutines # # random_word_starting_with # # In: $1 ... the first letter of the word (eg. 'a', 'b', 'c', etc.) # $2 ... the open filehandle of the word file # sub random_word_starting_with { my ($first_letter, $fh) = @_; my $p = $word_offsets_by_letter{lc $first_letter}; my $offset = $p->[int rand @$p]; seek($fh, $offset, 0) or die "$iam: failed to seek ($!)\n"; my $word = <$fh>; chomp $word; return $word; }

      And if this is something you need to do a lot of, I like Zaxo's suggestion of using a database.


      s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/

      As for binary searching for first letters in a sorted dictionary, I've done that in my submission to the word ladder problem which was a problem in perl quiz of the week expert edition week 22.

      However, I don't quite see how that would help in this word selection problem. Unless you somehow preprocess the dictionary, you cannot select a random word in constant time even if you know the offsets of the first letters, as just choosing a random offset favors long words. So, IMO, you have to read the whole or most of the dictionary file for this problem unless it's preprocessed in some way. Random access to the file may still slightly help after reading it completely, but not very much unless there are very long words in the "dictionary".

Re: improving the efficiency of a script (random sample)
by ambrus (Abbot) on Jun 18, 2006 at 20:46 UTC

    I show a one-pass solution to this problem using the combinatorical algorithm. Here one-pass means that you need only O(gm) memory if you want to print g words and the maximal word length is m, you have to read the file only once and don't even know the number of words in the dictionary in advance. Apart from this, I don't take emphasis on that the algorithm doesn't take too much computation time. That could also be easily done (while still keeping the previous efficency conditions true). For that, see algorithm R in chapter 3.4.2 in Knuth, but I leave the implementation as an exercise to the reader.

    You didn't say if there's any requirement on the order of the words printed, so I assume it can be anything (whatever is simplest to implement). I'll also assume that if there's fewer than 100 words starting with a certain letter, we have to print all of them. And naturally assume the usual disclaimer for the code: I put this together fast and it may have errors.

    As a simpler example, I first show how to just select 100 words uniformly randomly from a dictionary, independently of first letters.

    use warnings; use strict; my $g = 100; my @c; my $n = 0; while(<>) { i +f (rand() < $g / ++$n) { splice @c, int(rand(@c)), $g <= @c, $_; } } +print for @c;
    Now doing this for every letter we get this:
    use warnings; use strict; my $g = 100; my %c; my %n; while(<>) { my $l + = /(.)/ && lc($1); my $c = \@{$c{$l}}; if (rand() < $g / ++$n{$l}) { + splice @$c, int(rand(@$c)), $g <= @$c, $_; } } print @$_ for values( +%c);

    Update. Another one-pass solution would be to use heaps. You create a heap for each letter, add words as you read them to the corresponding heap using a random number as priority, and popping an element if the heap is larger than 100. I guess that this would be less CPU-efficent as the above mentioned good algorithm in Knuth if well implemented.

    Update 2008 oct 9: see also Randomly select N lines from a file, on the fly.

    Update 2009-12-26: see also Random sampling a variable record-length file. which – by the time you look there – should have some good solutions as well.

Re: improving the efficiency of a script
by McDarren (Abbot) on Jun 18, 2006 at 16:48 UTC
    My dictionary (/usr/share/dict/words) only has 96274 words :)

    But anyway, here is my go at this - I'm not sure how efficient (or otherwise) it is - but it seems to work :)

    Update 1: blah - I just realised that it doesn't work at all - the shuffle isn't doing anything. I'll update it again later when I get it working.

    Update 2: - okay, working now :) (I think)

    #!/usr/bin/perl -w use strict; use List::Util qw(shuffle); my $dict = '/usr/share/dict/words'; open DICT, "<", $dict or die "Cannot open $dict:$!\n"; my %words; WORD: while (my $word = <DICT>) { chomp $word; my ($letter) = $word =~ /^(\w)/; next WORD if !$letter; push @{$words{$letter}}, $word; } close DICT; foreach my $letter (sort keys %words) { @{$words{$letter}} = shuffle(@{$words{$letter}}); for my $number (1 .. 100) { print "$words{$letter}[$number]\n"; } }

    Cheers,
    Darren :)

      I'm not sure how efficient (or otherwise) it is

      Much better than OPs solution ;) but time and space requirements for the data preparation is still O(n). much more than 0(m * log(n)) and O(m), respectively (once again, for sorted dictionaries only).

      So if dictionary is unsorted, the memory requirements are not a problem and if this task is not often repeated, yours is the best solution IMHO.

Re: improving the efficiency of a script
by sh1tn (Priest) on Jun 18, 2006 at 16:48 UTC
    my @words = <DICT>; my %data; /^(\w)/ and push @{$data{$1}}, $_ for @words; for my $letter (sort keys %data){ for my $number (1..100){ my $rand = int rand @{$data{$letter}}; print "Letter: $letter\tnumber: $number\trand word: ", $data{$letter}[$rand]; } }


Re: improving the efficiency of a script
by TedPride (Priest) on Jun 19, 2006 at 04:42 UTC
    This really takes two passes unless you want to load the whole thing into memory. The first pass counts how many words there are corresponding to each letter, and the second pass retrieves the chosen words from the file.
    use strict; use warnings; my ($words, $file, $handle, %range, %c, %n, $c, @words); $words = 100; $file = 'dictionary.txt'; $range{$_} = () for 'a'..'z'; open ($handle, $file); while (<$handle>) { $c = substr($_, 0, 1); next if !exists $range{$c}; $c{$c}++; } close ($handle); for (values %c) { choose($_); } open ($handle, $file); while (<$handle>) { $c = substr($_, 0, 1); next if !exists $c{$c} || $n{$c}++ < $c{$c}[-1]; chomp; push @words, $_; pop @{$c{$c}}; delete $c{$c} if $#{$c{$c}} == -1; } close ($handle); print join "\n", sort @words; ### Pick random numbers in range sub choose { my @c = 0..($_[0]-1); for (0..($words-1)) { swap(\@c, $_, rand ($_[0] - $_) + $_); } $_[0] = [sort {$b <=> $a} @c[0..($words-1)]]; } ### Swap two array items sub swap { my ($r, $x, $y, $t) = @_; $t = $r->[$x]; $r->[$x] = $r->[$y]; $r->[$y] = $t; }

      For a modest size file (1 M words x about 6 characters = 6 MB) loading it all into memory is likely to be much faster than going out to the file system twice. On the other hand the file system may cache it for you - ya just gotta suck it and see. Even then it may be different tomorrow than it is today.


      DWIM is Perl's answer to Gödel
      It doesn't necessarily require two passes.

      It seems reasonable to assume that a dictionary file is sorted case-insensitively (and simple to make it so if it isn't already). Then you can read the file building up a list of words beginning with letter 'a'. As soon as you come across the first word beginning with 'b', make your random selection of 100 words beginning with 'a'and then discard the word list and re-initialise it with the 'b' word just read. Repeat until you've gone through the alphabet.

      That way the job is done with a single pass but you don't have to keep the entire dictionary in memory, just one letter's worth.

      use strict; use warnings; my $howMany = shift || 100; my $done = 0; my $rxValidWord = qr{^([A-Za-z])[-a-z]+}; my $totalWordCt = 0; my $validWordCt = 0; my %letterCts = (); my @letters = (q{a} .. q{z}); my $currentLetter = q{}; my $rxCurrent = getNextLetter(); my @wordCache = (); my $dictFile = q{../Web2}; open my $dictFH, q{<}, $dictFile or die qq{open: $dictFile: $!\n}; while(<$dictFH>) { $totalWordCt ++; next if $done; next unless /$rxValidWord/; my $initLetter = $1; $validWordCt ++; chomp; if($initLetter =~ $rxCurrent) { push @wordCache, $_; } else { reportForLetter(); } } reportForLetter() if $currentLetter; close $dictFH or die qq{close: $dictFile: $!\n}; print qq{\n}, qq{Total words in dictionary - $totalWordCt\n}, qq{Valid words processed - $validWordCt\n}; sub generateSlice { my ($total, $howMany) = @_; my $rlPool = [0 .. ($total - 1)]; return $rlPool if $total <= $howMany; my $rlSlice = []; for (1 .. $howMany) { push @$rlSlice, splice @$rlPool, int rand scalar @$rlPool, 1; } return [sort {$a <=> $b} @$rlSlice]; } sub getNextLetter { if(@letters) { $currentLetter = shift @letters; my $charClass = q{[} . $currentLetter . uc $currentLetter . q{]}; return qr{$charClass}; } else { $currentLetter = q{}; $done ++; return 0; } } sub reportForLetter { my $savedWord = $_; print qq{\n}, qq{Found @{[scalar @wordCache]} words }, qq{for letter $currentLetter\n}; my $rlSlice = generateSlice(scalar @wordCache, $howMany); my @randomWords = @wordCache[@$rlSlice]; print qq{$howMany words selected at random\n}; print qq{ $_\n} for @randomWords; @wordCache = ($savedWord); $rxCurrent = getNextLetter(); }

      When run to look for 5 random word it produces this

      Cheers,

      JohnGG

Re: improving the efficiency of a script
by sulfericacid (Deacon) on Jun 21, 2006 at 03:36 UTC
    I appreciate the help and insight, everyone. I have a lot of information to read through to get a better idea of how BETTER to do things of this nature.

    I don't know why it did not dawn on me that slurping a huge file (my @blah = <DICT>) would be suicide as opposed to reading it line by line directly and pulling off only things I need.

    I probably should have figured that one out, but I do appreciate all the valuable advice everyone has put here.



    "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

    sulfericacid