Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
All,
Recently, I have been writing algorithms to play games on FaceBook. This is the real game for me and I often explore hunches to see if they teach me anything. The most recent game is called Word Twister and you are presented with 7 randomly selected characters (dups allowed). The object is to come up with as many words comprised of 3 or more of those characters within a certain time-frame.

A naive approach to this might be:

  1. Generate the powerset of the given characters
  2. For each subset, generate all the permutations
  3. For each permutation, check to see if it is in the dictionary
This is obviously not a very efficient approach but assuming you can fit the dictionary in memory as a hash, will work on a small number of characters such as this game (7). Update: If you are interested in the math and exclude duplicates - see this.

A smarter approach to this might be:

my $given = $ARGV[0]; # ... while (<$dict_fh>) { chomp; print "$_\n" if is_subset($_, $given); } sub is_subset { my ($str1, $str2) = @_; my (%h1, %h2); $h1{$_}++ for split //, $str1; $h2{$_}++ for split //, $str2; for my $chr (keys %h2) { return if ! $h1{$chr} || $h2{$chr} > $h1{$chr}; } return 1; }
There are obvious optimizations to be had such as pre-processing the dictionary and loading it in the hash form rather than creating it each time. If you don't look too closely, the algorithm scales based on the number of characters in the dictionary and the the number of characters in the given string.

I wanted to see if I could do better. My hunch told me there should be a way to

  1. Convert the strings to numbers
  2. Subtract the two numbers
  3. Apply a test to the difference to determine if it was a subset
This would turn the problem into:
my $input = $ARGV[0]; $input = str2val($input); # dictionary already pre-processed # create hash of possible subset values # ... while (<$dict_fh>) { chomp; my ($word, $val) = split /\t/; print "$word\n" if $is_subset{$input - $val}; }

I found that the answer was yes if you set a limit on the maximum length of input you would accept. Here is a brief explanation of the math:

a = 1 b = (a * max) + 1 c = (b * max) + 1 ... z = (y * max) + 1
What this does is count the total occurrences of each letter (order isn't important) and allow you to represent it as a single number. For instance, if max = 5 and your alphabet was only A .. F:
sub str2val { my ($str) = @_; my %convert = (A => 1, B => 6, C => 31, D => 156, E => 781, F => 3 +906); my $val = 0; for my $char (split //, $str) { $val += $convert{$char}; } return $val; } sub val2str { my ($val) = @_; my %convert = (A => 1, B => 6, C => 31, D => 156, E => 781, F => 3 +906); my $str = ''; return $str if $val < 1; for my $char (sort {$convert{$b} <=> $convert{$a}} keys %convert) +{ my $count = int($val / $convert{$char}); $str .= ($char x $count); $val %= $convert{$char}; } return $str; }
Ok, but how is the magic %is_subset constructed? It is simply the values of the powerset. Wait a minute I said. If I am generating the powerset why do I go through the hoops of converting to a number - why don't I just look up strings. Then I remember that in strings the order of letters is important (the naive brute force above). Ok, if I don't have to generate the permutations, this might be a practical approach.

I start looking at the math and I realize the approach without modification is not practical. To support a max string length of 7, the value of Z would be 1_564_580_056_274_625_717_608. The reason the numbers get so high so quickly is because you are summing maxN for N = 0 .. 25. Even if I am only doing subtraction, these numbers are too big to be working with and I want to be able to support strings larger than 7.

I started to think about a couple of ways that still may make this practical. The first was to not consider the general case (entire alphabet) but to work with only the characters in the given string. This was doomed to failure after I ran the numbers. First, you would have to generate all combinations of the various different string lengths you wanted to support. Remember the formula for N choose K is C = K! / N!(K - N)! So for N = 7, there are 657,800 unique combinations. You have to split the dictionary into that many files and then when you get your input string you can open the dictionary that only contains those letters (a smaller alphabet). You can see this solves one problem by creating a new one because you would need 1,562,275 additional files to support an input string of length 8. Of course, these needn't be actual files (database) but the problem is still not scalable and pre-processing will take forever.

The second approach was to go with a hybrid option. First, perform frequency analysis on the database to order the letters from most to least frequent. This means that instead of Z having the highest numerical value, Q would. We could then shave off the bottom X letters from the alphabet. The idea was to have two dictionary files (1 with words containing uncommon letters and 1 with only common letters). Once the input is examined, we dispatch to the straight forward brute forward approach with the alternate dictionary if an uncommon letter is observed and the weird math approach otherwise. After checking the numbers I realized it was still untenable.

What am I doing again?

Suddenly it dawns on me that a hybrid approach is what I want but not the one I was working on. The reason the naive brute force didn't work is because after generating the powerset, generating the permutations was necessary because order of letters in hash lookups. The weird math approach solved the order problem but created a new problem (numbers to big to work with). Could I eliminate the order of letters issue without using the weird math approach. Of course, just pre-process the dictionary by sorting the letters.

my $input = $ARGV[0]; my %is_subset = map {$_ => 1} powerset(split //, $input); while (<$dict_fh>) { chomp; my ($word, $normalized) = split /\t/; print $word if $is_subset{$normalized}; } sub powerset { # ... return map {join '', sort @$_} @subsets; }

This is what I call turning a problem on its (?:head|side|ear). I look at it in a completely different way to see if I learn anything. In this case, I learned several ways not to solve the problem but that's ok - I had fun and this was the real game for me anyway. I just wonder if other people do this? If so, what approaches do you take and do you have any examples to share?

Cheers - L~R


In reply to Turning A Problem Upside Down by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-04-19 19:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found