Limbic~Region has asked for the
wisdom of the Perl Monks concerning the following question:
All,
In Hangman Assistant, Lawliet posted an interesting problem. The unstated objective is to maximize the number of games won while minimizing the number of incorrect guesses. To summarize my participation in that thread, I found that with a comprehensive pruning algorithm, you could achieve great results (better than binary search) by choosing the letter that had the highest probability of being in the mystery word. Out of approximately 60K words, this approach won 96.28% of the time and only averaged 1.74 wrong guesses when winning. A nominal improvement can be made by breaking ties by looking at which letter has the highest overall density amongst candidate words (96.31% and 1.73 average wrong guesses).
I also postulated that it should be possible to do even better by taking into consideration other factors. I started to do this by examining just the words that the algorithm failed to win against. I am finding little time to work on this but since this may appeal to more people, I am presenting it as a challenge. Rather than starting from scratch, here is the data that you can assume has already been calculated for you to render your decision for "best letter to choose".
$num_of_incorrect_guesses_remaining;
$num_of_candidate_words;
%list_of_candidate_words;
$current_state_of_mystery_word; # 'four' may look like **u* meaning on
+ly u is known
%hash_of_letters_already_guessed;
# For each possible letter to choose from
# A hash with the following keys
num_of_words_present_in
total_count_of_letter_across_candidate_words
probability_of_being_right
probability_of_being_wrong
num_of_candidates_remaining_if_wrong
maximum_num_of_candidates_remaining_if_right
# Optionally for each possible letter to choose from
# A hash describing the count and position possibilities
# For instance:
e => {
'03' => 47,
'1' => 99
},
# Means of the remaining candidate words, the letter 'e'
# Comes in two flavors, either only in the first position
# Or in positions 0 and 3. So if e was chosen and correct
# Either 47 words would remain or 99  max being 99
From this, you can easily duplicate my algorithm by determining which letter to choose based on which word appears in the most candidate words (highest probability of being right) and break ties by using the highest overall density. You should however be able to do so much more. There are a couple of main approaches that I can think of. First, a 1 size fits all algorithm that weighs various different pieces of data to produce an overall score. The advantage here is that you should be able to tune these weights without touching code to produce the best results for a given "dictionary". The second approach would be to have multiple algorithms already fine tuned for a given set of conditions and dispatch to that algorithm based on initial conditions. For instance, it may be ok to take a risk of guessing wrong if the remaining wrong guesses is still high or perhaps utilize the "simple" algorithm already presented unless the probabilities and candidates remaining (if right or wrong) is beyond some threshold.
Even though both of those approaches leave implementations wide open, don't assume those are the only two ideas to be had. Feel free to come up with whatever you think is best. Heck, you may even want to have a preprocess fine tuning algorithm based on the dictionary that allows weights and/or algorithms to be chosen by the data.
Re: Challenge: Design A Better Hangman Algorithm by NiJo (Friar) on Jul 15, 2009 at 20:40 UTC 
An higher order approach would be to look at probabilities for chains of letters. E. g. in English the chance for an 'u' should be higher than average after a 'q'.
This is based on the concecpt of apes using a typewriter to produce texts (Sir Arthur Eddington, 1927). I own a 1988 special edition of Scientific American on Computer problems (translated to German), where a BASIC program is discussed to collect statistics from longer input texts and a random text is produced wheighing probabilities for the next letter with collected statistics for N preceeding letters. Interpunctation and space characters are included as letters into this randomly produced string. At N=4 one entry might be ('Per','l', 50%). At N=5 most of the words are correct and resemble the input text, but sentences are nonsense.
Perl really shines at these applications. The author Brian Hayes hints at large memory requirements for the Ndimensional array where many fields remain empty. A perfect application for a hash...  [reply] 

NiJo,
There is value in what you are suggesting  in fact, I suggested the very same thing to Lawliet before working out my solution. Let me explain why it isn't as great an idea as you might think. After each guess (right or wrong), the candidate list is pruned to only the possible remaining words. If 'q' is guessed and is correct, then the next time around, 'u' will already jump to the top of the list of letters with the most probability to be correct. You don't need to know any special knowledge about letter frequency (alone or in chains) to get this benefit.
In other words, the data you are presented with already takes into consideration only the words that can possible fit given the letters provided and frequency analysis has been done on them. If you have something else in mind though, I am both open and interested in hearing it.
 [reply] 

 [reply] 
Re: Challenge: Design A Better Hangman Algorithm by QM (Parson) on Jul 15, 2009 at 21:18 UTC 
As an aside, or perhaps Part II of the challenge, is to consider what happens when the "opponent" chooses words based on the degree of guessing difficulty (whether it's known that a certain guessing algorithm is employed, or just assuming generic hueristics about guessing).
For instance, when I've played casually with someone the first time, I usually pick "rhythm", since it has no conventional vowels. (This only works once though). After that, I would try a word from a set of words of the same length with many letters/positions in common, but not just a simple substitution. (This is harder as word length goes up.)
Can the algorithm have a special mode to counter biased word choice?
QM

Quantum Mechanics: The dreams stuff is made of
 [reply] 

QM,
For the record, my algorithm(s) wins on the word 'rhythm' but with only 1 wrong guess to spare (6 used and 7 is default). I did indicate that you could preprocess a dictionary to optimize your algorithm. In my case, I know that with my 'simple' solution, I lose on 2254 of the 60K words tested.
These 2254 words then become what I believe you are considering biased and are the very ones I am hoping people come up with a solution for. If I use an alternate algorithm against just these words, I can win 787 but still lose 1467. As JavaFan pointed out elsewhere in that thread, for certain words  guessing is all you can do. I have absolutely no problem if you can figure out this word is biased early on and use an alternate algorithm for it. My issue is that the algorithm that works better on these biased words doesn't work great on normal words.
In short, the answer is yes if you can figure out how.
 [reply] 

I've implemented two algorithms. 1) look which letter(s) appear in the most words which are still possible, pick the one which appears in most words. Break ties randomly. 2) look at all the letters that appear in the words that are still possible, pick one weighted on frequency (so, if there are 15 words left, and 'l' appears in 10, 'm' in 5, than the chance of picking 'l' is twice the chance of picking 'm').
Using this with a dictionary of 355k words, algorithm 1) always find 'rhythm' with no guesses to spare (7 failures). algorithm 2) sometimes fails to find 'rhythm' in 7 failures or less, but the average number of failures is less (6.60). This surprised me.
 [reply] 
Re: Challenge: Design A Better Hangman Algorithm by Kache (Initiate) on Dec 12, 2012 at 14:29 UTC 
Stumbled on this (and the previous) old thread. Thought I could provide some input even if the thread is dead.
Strategy in this thread up to this point:
 Find all possible candidate words, given correct guesses so far. e.g.: If the word is "PerlMonks" and we've got "Pe___o_ks", candidate words will match the regex "Pe...o.ks" perfectly.
 Determine which unguessed letter appears in the most candidate words. Break ties for letters that occur most of all (i.e. density).
 Discussion of using probabilistic/heuristic analyses, with undetermined efficacies.
I would suggest one more deterministic strategy before getting to (3): entropy and distribution.
Suppose we have the following five candidate words: aacd, aaef, bbgh, bibj. Using rules listed in (2), letters 'a' and 'b' are still tied.
However, it's clear that guessing 'b' will give us more information than 'a'. While guessing 'b' would reveal the solution, guessing 'a' would not. This is because 'b' is more evenly and chaotically distributed throughout the length of the words, allowing the step (1) in the next guessing round to narrow down candidate words even faster.
 [reply] 

