Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
Keep It Simple, Stupid
 
PerlMonks  

Re: Challenge: 8 Letters, Most Words

by davido (Archbishop)
on Oct 07, 2013 at 20:41 UTC ( #1057309=note: print w/ replies, xml ) Need Help??


in reply to Challenge: 8 Letters, Most Words

I believe that this will always produce a correct result. So here is a solution that gets it right in under 1 minute 40 seconds, using SQLite:

Update: Here's a newer version that has better reporting:

use strict; use warnings; use DBI; # Helpers for indexing into data structures; use constant DECOMP_KEY => 0; use constant DECOMP_LETTER_COUNTS => 1; use constant DECOMP_BUCKET_COUNT => 2; use constant BUCKETS_LETTER_COUNTS => 0; use constant BUCKETS_COUNT => 1; use constant DB_ALPHAS => 0; use constant DB_COUNT => 1; # The dictionary is.... use constant DICTIONARY => '2of12inf.txt'; my( $bags_aref, $shorter_aref, %alpha_freq ) = ( [], [], () ); open my $dict_fh, '<:encoding(utf8)', DICTIONARY or die $!; while( my $word = <$dict_fh> ) { next unless $word =~ m/^([a-z]{1,8})\b/; my $wanted_word = $1; push( ( length $wanted_word == 8 ? $bags_aref : $shorter_aref ), $wa +nted_word ); $alpha_freq{$_}++ for split //, $wanted_word; } # Find those buckets that have the least common letter in target word. # Use this as a comparison key for fastest rejection of letters from b +ags. my @ordered_letters = sort { $alpha_freq{$a} <=> $alpha_freq{$b} } keys %alpha_freq; # Open the db, and create the empty 'buckets' table. my $dbh = open_db( \@ordered_letters ); { # Create the buckets. These will become database rows, so we only n +eed them # for a short time. my %buckets; foreach my $word ( @$bags_aref ) { my $decomposed = decompose_word($word); if( exists $buckets{$decomposed->[DECOMP_KEY]} ) { $buckets{$decomposed->[DECOMP_KEY]}[BUCKETS_COUNT]++; } else { $buckets{$decomposed->[DECOMP_KEY]} = [ $decomposed->[DECOMP_LETTER_COUNTS], $decomposed->[DECOMP_BUCKET_COUNT] ]; } } # Build the database. Doing it in a transaction gives better perfor +mance. # This creates a bucket for each unique letter-set from words of len +gth == 8. # If there are several of the same letter-set, the bucket count inse +rted is # some value greater than 1. Otherwise, 1. # Column letter indices (eg, a, b, c) are a lie; we use frequency-as +cending # column order in reality, which is determined at runtime. # \ /--- letters + count ----------\ / # L a b c d e f +g h i j k l m n o p q r s t u v w x y z C my $insert = q{ INSERT INTO 'buckets' VALUES ( ?, ?, ?, ?, ?, ?, ?, +?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? ); }; $dbh->do( 'BEGIN TRANSACTION;' ); my $sth = $dbh->prepare($insert); print "Building database...\n"; foreach my $bucket_key ( keys %buckets ) { print "Creating bucket: $bucket_key\n"; my( $letters, $letter_counts, $bucket_count ) = ( $bucket_key, $buckets{$bucket_key}[DB_ALPHAS], $buckets{$bucket_key}[DB_COUNT] ); $sth->execute( $letters, @{$letter_counts}{@ordered_letters}, $bucket_count ); } $dbh->do( 'COMMIT TRANSACTION;' ); } # Database is built. We're done with %buckets. print "All 8-letter words processed. Database built.\n"; print "After " . scalar @$bags_aref . " 8-letter words:\n"; gather_stats($dbh); print "\n"; # Now, tally each bucket by accumulating all the words of size < 8. print "Adding ", scalar @$shorter_aref, " words less than eight letter +s...\n"; $dbh->do( 'BEGIN TRANSACTION' ); my $entry = 0; # An iteration counter for outputting periodic progres +s report. # Iterate over every word of length < 8. foreach my $word ( @$shorter_aref ) { # Get the word's composition. my( $letters, $letter_counts, $bucket_count ) = @{ decompose_word($w +ord) }; # We only care about the letters that exist in $word. # We use least frequent first order, as an optimization. my @criteria_keys = grep { $letter_counts->{$_} != 0 } @ordered_lett +ers; my @criteria_values = @{$letter_counts}{@criteria_keys}; # Example: UPDATE buckets SET count = count + 1 WHERE b >= 1 and o > += 2; ('boo') # Any buckets where these criteria are met could spell "boo" my $query = q{UPDATE 'buckets' SET count = count + 1 WHERE }; my $where_clause = join ' AND ', map { "$_ >= ?" } @criteria_keys; $query .= $where_clause . ';'; my $sth = $dbh->prepare( $query ); $sth->execute(@criteria_values); # Values come in the same order as + @ordered_letters. if( ++$entry % 2500 == 0 ) { print "Cumulative:\n"; print "After " . scalar @$bags_aref + $entry . " words:\n"; gather_stats($dbh); print "\n\n"; } } $dbh->do( 'COMMIT TRANSACTION' ); print "\nTotals\n------\n"; print "Processed ", ( scalar @$bags_aref + scalar @$shorter_aref ), " +words.\n"; gather_stats($dbh); # Pass in a database handle. Outputs current best letters and how man +y words # they spell. sub gather_stats { my $dbh = shift; # Now we find the maximum value in the "count" column among all buck +ets. my $sth = $dbh->prepare( 'SELECT MAX(count) FROM buckets' ); $sth->execute; my( $count ) = $sth->fetchrow_array; # Now we find any rows that have this maximum count. $sth = $dbh->prepare( q{SELECT letters FROM buckets WHERE count = ? +} ); $sth->execute($count); while( my @letters = $sth->fetchrow_array ) { print "(@letters)"; } print " spells $count words.\n"; } # Open up a database, create an empty "buckets" table that looks like: # letters, a, b, c, d, e, f, ... z, count. # "letters" is all the letters for a bucket together as a string. # a, b, c... will be in frequency-ascending order. Each value represen +ts number # of times a letter appears in this bucket. # "count" is a count of how many times the bucket can be used. sub open_db { my $ordered_letters_aref = shift; my $create_sql = q{CREATE TABLE 'buckets' ( letters VARCHAR(8), }; $create_sql .= "$_ INTEGER, " for @$ordered_letters_aref; $create_sql .= "count INTEGER )"; my $dbh = DBI->connect( "dbi:SQLite:dbname=eightletters", "", "", { RaiseError => 1 } ); $dbh->do( q{DROP TABLE IF EXISTS 'buckets';} ); my $sth = $dbh->prepare($create_sql); $sth->execute(); return $dbh; } # Pass in a word, get back a structure: # word == 'boo': # [ # 'boo', # In reverse letter frequency order # { a => 0, b => 1, c => 0, ... o => 2 ... z => 0, }, # 1, # Initial state of bucket; we found one occurrence. # ] sub decompose_word { my $word = shift; my @letters = sort { $alpha_freq{$a} <=> $alpha_freq{$b} } split //, + $word; my $letter_counts_href; $letter_counts_href->{$_} = 0 for 'a' .. 'z'; $letter_counts_href->{$_}++ for @letters; return [ join( '', @letters ), $letter_counts_href, 1 ]; }

The result I get with $ time ./eightletters.pl is:

....... (some output deleted for brevity) ...... Totals ------ Processed 40933 words. (pntirase) spells 346 words. real 1m38.981s user 1m37.770s sys 0m0.116s

The logic works as follows:

  • Grab all words of eight characters or less from the dictionary, as that's all we care about.
  • Determine the letter frequency from the trimmed dictionary.
  • Split the list of words into those with eight characters, and those with fewer.
  • Create buckets for each unique combination of letters that represents eight-letter words. In cases where the same combination repeats, just increment the counter for that bucket.
  • Insert all of the buckets into a database. Each row is a bucket. Each column is a count of how many times a letter is used in that bucket. The final column is a counter for how many times the bucket is used.
  • Now run through all the words of length less than eight. Select from the database every bucket that a given word can fit into, and increment the counters for each of those buckets.
  • Find the maximum count; ie, the counter for the bucket where the most words fit.
  • Find the row this occurs in, and grab the letters that the bucket represents.

In the database, the letter columns are in reverse-frequency order, or frequency-ascending order. That way, 'q' is the first letter column, and so on, finishing at 'e'. The queries also use this reverse-frequency order, so that the "AND"s within the "WHERE" clause can narrow down the list of records to include as early as possible. ...at least that's my theory. ;)

The code is embarrassingly dirty, and there are several significant opportunities to further trim cycles. But I wanted to put the ideas out there in case someone else wanted to consider the methodology.


Dave


Comment on Re: Challenge: 8 Letters, Most Words
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1057309]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2014-04-18 00:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (460 votes), past polls