Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^4: Challenge: 8 Letters, Most Words

by Limbic~Region (Chancellor)
on Oct 04, 2013 at 21:11 UTC ( #1056938=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Challenge: 8 Letters, Most Words
in thread Challenge: 8 Letters, Most Words

CountZero,
I am going to write a brute force solution this weekend but your math isn't quite right. First, you don't need all 26^8th since order doesn't matter. The upper bound is less than 5.2 million. We can reduce it further if we consider the maximum number of times a given letter can repeat. We do not have to consider a worst-case scenario dictionary so I think I can get that 5.2 million down to around 3 million. We still need to compare each of those 3 million against tens of thousands of words so I will be using C to do it but I think run time will be much less than 24 hours.

Update: My math was off somewhere. The upper bound was 13_884_156 and I was only able to reduce it down to 12_461_993. Since my word list contains 40,933 words (filtered down from 81_536), that means I am up to 510_106_759_469 checks. In order to finish in under 24 hours, I will need to be able to do 5.9 million checks per second. I hope my C skills are up for the challenge (stay tuned).

Cheers - L~R


Comment on Re^4: Challenge: 8 Letters, Most Words
Re^5: Challenge: 8 Letters, Most Words
by CountZero (Bishop) on Oct 05, 2013 at 08:27 UTC
    I did some more analysing the 2of12inf.txt dictionary and for words of maximum 8 characters, this is the maximum number of the same characters that can exist in any one word:
    a => 4, b => 3, c => 3, d => 4, e => 4, f => 4, g => 4, h => 3, i => 3, j => 2, k => 3, l => 4, m => 3, n => 4, o => 4, p => 3, q => 1, r => 4, s => 5, t => 3, u => 4, v => 2, w => 3, x => 2, y => 3, z => 4,
    In other words, choose all combinations of any 8 chars out of the following string: "aaaabbbcccddddeeeeffffgggghhhiiijjkkkllllmmmnnnnoooopppqrrrrssssstttuuuuvvwwwxxyyyzzzz" and run your dictionary against each of these and your search is exhaustive.

    Of course that is still about 2.14 E15 combinations to check, but many of these will be the same. So, perhaps someone with more mathematical skills than I can write an efficient generator for these combinations that does not have to go through them all to weed out the duplicates.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
      CountZero,
      I did some more analysing the 2of12inf.txt dictionary and for words of maximum 8 characters, this is the maximum number of the same characters that can exist in any one word:

      I had already done that which is where my math was coming from.

      Of course that is still about 2.14 E15 combinations to check, but many of these will be the same. So, perhaps someone with more mathematical skills than I can write an efficient generator for these combinations that does not have to go through them all to weed out the duplicates.

      Have you looked at my code? I already indicated that doing this reduces the number of 8 letter words that needs to be checked down to 12,461,993. The reason it grows back up to 510,106,759,469 loops/checks is because you still have to check each one of the 12.5 million against the 40K actual words. I was able to do this using C in 12 minutes.

      Cheers - L~R

        Your C-skills are legendary.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics
      > In other words, choose all combinations of any 8 chars out of the following string: "aaaabbbcccddddeeeeffffgggghhhiiijjkkkllllmmmnnnnoooopppqrrrrssssstttuuuuvvwwwxxyyyzzzz"

      The point is to handle duplications, if all letters are unique the answer is simply (26 over 8)

      DB<288> sub fac {my $x=1; $x*=$_ for 2..$_[0]; $x} DB<289> sub binom { my ($n,$k)=@_; fac($n)/(fac($n-$k)*fac($k)) } DB<290> binom 26,8 => 1562275

      Reasoning is simple, it calculates all binary vectors of length 26 with exactly 8 1-bits.

      But with duplications its more complicated, e.g. 4 out of "aabcd" is not (5 over 4)=5

      a bcd abcd aa cd aab d aabc

      cause the first 2 solutions are identical.

      Generating all combinations and filtering the unique once is normally not very clever, cause the overhead can be enormous.

      DB<292> length "aaaabbbcccddddeeeeffffgggghhhiiijjkkkllllmmmnnnnoooo +pppqrrrrssssstttuuuuvvwwwxxyyyzzzz" => 86 DB<293> binom 86,8 => "53060358690"

      And I'm stuck finding a formula which calculates all unique solutions, but generating is easier, just don't allow bitvectors with "gaps" between identical letters:

      so

      "aaaabbbc..." # pattern "1000100...." # ok => ab... "1100100...." # ok => aab... "1001100.... # not ok => aab...

      I will update this post with a loop generating all possibilities soon.

      update

      indeed L~R's number of 12461993 possibilities is correct

      The following (non-optimized) code took 3 minutes to calculate them:

      Cheers Rolf

      ( addicted to the Perl Programming Language)

        LanX,
        The C code I wrote to calculate it is below. It finished nearly instantly. I was trying to determine how to write the rest of the code (could it fit in memory). It also ended up being part of my final solution since I realized it was unnecessary to keep them in memory using a high water mark algorithm.

        Cheers - L~R

        LanX,
        And I'm stuck finding a formula which calculates all unique solutions

        If a letter was allowed to repeat up to the maximum 8 times, the formula is:

        (n + (k - 1))! -------------- k! (n - 1)! 33! -------- = 13_884_156 8! * 25!

        I am not sure how to get from 13_884_156 to 12_461_993 strictly through calculation but it seems like it should be possible.

        Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1056938]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (9)
As of 2014-09-02 19:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (29 votes), past polls