http://www.perlmonks.org?node_id=1031775

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
I have this really weird idea in my head that refuses to fully form so I can't explain it. As I was falling asleep last night, I realized that I could ask my question without explaining my idea by relating it to something I do understand.

Background

Two months after I started learning perl, I wrote AI Animals. The idea is simple. You have a tree where nodes can either be yes/no questions or guesses. If a guess is determined to be incorrect, the node is replaced with a yes/no question that tells the difference between the guessed animal and the correct animal. If you need more background/explanation, please go visit that node.

Challenge

You have been invited to particpate in a contest to devise the optimal starting tree for the game. Each participant is given a file that contains a million previous played games from an unknown number of different starting trees. You have 1 week to analyze the data and build your initial tree. The day of the contest 50,000 previous games will be randomly selected and the winner will be the one that asks the fewest number of questions to correctly guess all the animals.

You immediately recognize a few flaws that you can exploit to guarantee yourself the win. First, no unknown animals will be introduced. Second, while the file contains a million previous games, there is only a small number of unique animals ever chosen (let's say 100 for our purposes). Finally, you realize that you can tell the frequency that each animal will be chosen by how frequently it was previously chosen. All you need to do is structure the tree in a way that the number of questions to guess a given animal is inversly proportional to how frequently it has been chosen. Once you know where the animals must appear in the tree, you can come up with the questions by hand.

Setting aside the "questions" portion since my problem really has nothing to do with the game: How can I figure out the optimal tree assuming I have a small fixed number of items to identify and the frequency that I will need to identify them?

Cheers - L~R

  • Comment on Challenge: Optimal Animals/Pangolins Strategy

Replies are listed 'Best First'.
Re: Challenge: Optimal Animals/Pangolins Strategy
by roboticus (Chancellor) on May 02, 2013 at 16:03 UTC
      roboticus,
      Maybe - I am not sure. The idea that I am not able to explain has to do with compression. I understand Huffman coding which is where my idea came from. In my head it is different so if the example I came up with is equivalent to Huffman coding I have failed.

      Update: I am now pretty sure you are right. I still can't explain why what I am thinking about is different but this may help: In addition to coding for individual characters, I want to use any nodes that only have 1 leaf entry to code for character tuples. This is done AFTER the Huffman coding.

      Cheers - L~R

        Limbic~Region:

        Here's kind of what I was thinking. The Huffman tree keeps the most common animals (fish, dog, cat) at the top of the tree with few questions, and the unusual critters (unicorn, pegasus, axolotl) at the bottom of the tree.

        Is this something along the lines of what you're after?

        #!/usr/bin/perl use strict; use warnings; use Data::Dumper; # Read list of animals/occurrences into @T my @T; my $ttl=0; while (<DATA>) { next if /^\s*$/; my ($animal, $occurrences) = split /\s+/, $_; $ttl += $occurrences; my $t = bless [ $animal, $occurrences ], "Animal"; insert( $t ); } print "Total occurrences: $ttl\n"; # Turn it into a huffman tree while (@T > 1) { my $L = shift @T; my $R = shift @T; my $t = bless [ [$L, $R], $L->[1]+$R->[1] ], "Question"; insert( $t ); } #print Dumper($T[0]); my $avg = compute_avg_num_guesses($T[0]) / $ttl; print "Avg: $avg\n"; sub compute_avg_num_guesses { my $t = shift; my $question_depth = shift // 0; if (ref $t eq "Animal") { my $num_questions = $t->[1] * $question_depth; print "Animal: $t->[0], depth $question_depth * occurrences $t +->[1] ==> $num_questions\n"; return $num_questions; } else { my $sum_questions = compute_avg_num_guesses($t->[0][0], $quest +ion_depth+1); $sum_questions += compute_avg_num_guesses($t->[0][1], $questio +n_depth+1); return $sum_questions; } } sub insert { # Inserting in order simplifies the conversion into a huffman tree +... my $t = shift; push @T, $t; my $idx = $#T; while ($idx) { ($T[$idx],$T[$idx-1]) = ($T[$idx-1],$T[$idx]) if $T[$idx][1] < + $T[$idx-1][1]; --$idx; } } __DATA__ fish 150 horse 62 dog 85 cat 91 frog 28 axolotl 1 gerbil 3 hamster 5 ocelot 5 platypus 3 seal 18 badger 17 walrus 15 wolf 55 wolverine 22 sheep 60 cow 75 unicorn 1 pegasus 1

        Running it gives me:

        $ perl huffman.pl Total occurrences: 697 Animal: walrus, depth 5 * occurrences 15 ==> 75 Animal: badger, depth 5 * occurrences 17 ==> 85 Animal: seal, depth 5 * occurrences 18 ==> 90 Animal: pegasus, depth 8 * occurrences 1 ==> 8 Animal: axolotl, depth 9 * occurrences 1 ==> 9 Animal: unicorn, depth 9 * occurrences 1 ==> 9 Animal: hamster, depth 7 * occurrences 5 ==> 35 Animal: ocelot, depth 7 * occurrences 5 ==> 35 Animal: gerbil, depth 8 * occurrences 3 ==> 24 Animal: platypus, depth 8 * occurrences 3 ==> 24 Animal: cow, depth 3 * occurrences 75 ==> 225 Animal: fish, depth 2 * occurrences 150 ==> 300 Animal: dog, depth 3 * occurrences 85 ==> 255 Animal: cat, depth 3 * occurrences 91 ==> 273 Animal: wolverine, depth 5 * occurrences 22 ==> 110 Animal: frog, depth 5 * occurrences 28 ==> 140 Animal: wolf, depth 4 * occurrences 55 ==> 220 Animal: sheep, depth 4 * occurrences 60 ==> 240 Animal: horse, depth 4 * occurrences 62 ==> 248 Avg: 3.45050215208034

        So if I haven't screwed up, it looks like given the stated distributions and guesses, that you'd need 3.45 guesses on average.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: Challenge: Optimal Animals/Pangolins Strategy
by BrowserUk (Patriarch) on May 02, 2013 at 16:34 UTC

    You just end up with a really lopsided tree:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; use List::Util qw[ reduce ]; our $N //= 1; my %byFreq = map{ int( rand 1000 ) => $_ } 'A'x$N .. 'Z'x$N; pp \%byFreq; my @tree; reduce{ $a->[0] = $byFreq{ $b }; $a->[1] = []; } \@tree, sort{ $a <=> $b } keys %byFreq; pp \@tree;

    Output:


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      You just end up with a really lopsided tree

      Then I must have made a mistake in my explanation because the solution should be closer to Huffman coding as roboticus astutely concluded above. Since I can't explain why what I am trying to accomplish is slightly different, I guess I am stuck.

      Cheers - L~R

        Well, you mentioned proportional which I interpreted to mean that higher frequencies should take longer to reach than lower frequencies, which barring the possibility of equal frequencies, a lop-sided tree achieves.

        (Albeit you said inversely proportional which would mean reversing the order of the sort from what I posted.)

        The only other sense I can get from the information provided -- brought on by the mention of Huffman -- is that you are perhaps looking to minimise the depth of the tree. This does that by building a heap and then converting it to a tree rather clumbsily. Though that could be fixed if the idea is right:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; use List::Util qw[ reduce ]; use enum qw[ NAME FREQ LEFT RIGHT ]; our $N //= 1; my$n = 0; my @heap = map{ $_->[LEFT] = ++$n; $_->[RIGHT] = ++$n; $_; } sort { $a->[FREQ] <=> $b->[FREQ] } map[ $_ , int( rand 1000 ) ], 'A'x$N .. 'Z'x$N;; my @tree = map { $_->[LEFT] = $heap[ $_->[LEFT] ], $_->[RIGHT] = $heap[ $_->[RIGHT] ] } @heap; pp \@tree;

        Output:

        C:\test>1031775.pl do { my $a = [ [ "Y", 1, [ "G", 4, [ "D", 166, ["X", 245, ["I", 516, undef, undef], ["O", 563, undef, undef +]], ["K", 315, ["M", 628, undef, undef], ["R", 710, undef, undef +]], ], [ "A", 218, ["P", 324, ["T", 731, undef, undef], ["Q", 732, undef, undef +]], ["J", 374, ["V", 735, undef, undef], ["C", 835, undef, undef +]], ], ], [ "E", 33, [ "U", 220, ["L", 393, ["S", 845, undef, undef], ["F", 930, undef, undef +]], ["W", 471, ["Z", 944, undef, undef], undef], ], ["H", 228, ["B", 507, undef, undef], ["N", 515, undef, undef]] +, ], ], 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', 'fix', ]; $a->[1] = $a->[0][2]; $a->[2] = $a->[0][3]; $a->[3] = $a->[0][2][2]; $a->[4] = $a->[0][2][3]; $a->[5] = $a->[0][3][2]; $a->[6] = $a->[0][3][3]; $a->[7] = $a->[0][2][2][2]; $a->[8] = $a->[0][2][2][3]; $a->[9] = $a->[0][2][3][2]; $a->[10] = $a->[0][2][3][3]; $a->[11] = $a->[0][3][2][2]; $a->[12] = $a->[0][3][2][3]; $a->[13] = $a->[0][3][3][2]; $a->[14] = $a->[0][3][3][3]; $a->[15] = $a->[0][2][2][2][2]; $a->[16] = $a->[0][2][2][2][3]; $a->[17] = $a->[0][2][2][3][2]; $a->[18] = $a->[0][2][2][3][3]; $a->[19] = $a->[0][2][3][2][2]; $a->[20] = $a->[0][2][3][2][3]; $a->[21] = $a->[0][2][3][3][2]; $a->[22] = $a->[0][2][3][3][3]; $a->[23] = $a->[0][3][2][2][2]; $a->[24] = $a->[0][3][2][2][3]; $a->[25] = $a->[0][3][2][3][2]; $a; }

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        /div

        Aren't Huffman trees generally expected to be lop-sided?

Re: Challenge: Optimal Animals/Pangolins Strategy
by QM (Parson) on May 02, 2013 at 18:04 UTC
    I think the previous commenters overlooked the questions!

    While the previous games will certainly have named animals (and thus frequencies), where do the yes/no questions get involved?

    All of the previous games would also record the questions. It is possible that the questions are not optimal (or even relevant), so a better approach might be a list of attributes for each possible animal, independent of its history in previous games.

    For practical purposes, besides yes/no, these games usually allow "doesn't matter" and "I don't know".

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      QM,
      Well, I did say:

      Setting aside the "questions" portion since my problem really has nothing to do with the game

      The point is that once you have an optimal tree, you can find the questions that work.

      Cheers - L~R

        "The point is that once you have an optimal tree, you can find the questions that work"

        I am not sure this assumption is true, unless you allow questions like:

        Is it in the group 'cow, fish, chicken, rhinoceros, amoeba, phoenix, suckling pigs, Those that tremble as if they were mad?
        If we are limited to more normal questions such as 'Does it live in water', 'Is it a mammal' Then will we always be able to find a question to divide each set of candidates? It looks like you are going to need to analyse all questions asked before and find the ones that split the list of all animals most optimally for your tree, probably adapting the tree to fit the possible questions as you go. Or interface to Wikipedia and get real AI like.

        Cheers,
        R.

        Pereant, qui ante nos nostra dixerunt!
        Yep, my bad.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      @QM:
      «I think the previous commenters overlooked the questions!...»

      May be this is true. And i'm repeating myself and this isn't my very best day. But i don't give up. Else i won't sleep tonight.

      So please tell us: what is your solution?

      Karl

      «The Crux of the Biscuit is the Apostrophe»

        I didn't have a solution in mind. There's the simplistic AI that asks you for a distinguishing question and records the answer every time it guesses wrong. To get beyond that you would need a matrix of questions, and answers for each animal; or otherwise generate a semantic database and minimize the categories and question tree. I think it ends up being something like generating a minimal cover set (but I'm fuzzy on this).

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Re: Challenge: Optimal Animals/Pangolins Strategy
by Anonymous Monk on May 03, 2013 at 12:40 UTC
    This is the Huffman problem, but of a taxonomy tree. "Simple Huffman" will tell you what is the shortest possible tree, but taxonomy rules demand that Orangutans and Goldfish can't be in the same branch. The problem is to develop a taxonomy which will produce the shortest possible Huffman tree that takes those taxonomy rules into account. The exact nature of the allowed questions – a yes-or-no question is binary, others are not – is also essential to know.