Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^3: Challenge: 8 Letters, Most Words

by CountZero (Bishop)
on Oct 04, 2013 at 20:52 UTC ( #1056932=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Challenge: 8 Letters, Most Words
in thread Challenge: 8 Letters, Most Words

You are right, because I check against all 8 character words and 'aaaaaa' only has six and therefore it is not considered. To be absolutely certain one would have to check against all combinations of 8 characters, i.e. checking your whole dictionary against each of 208 827 064 576 combinations.

So my solution answers the question with one extra condition: that the 8 characters together form a word too.

And I seem to have an "off by one error" somewhere.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics


Comment on Re^3: Challenge: 8 Letters, Most Words
Re^4: Challenge: 8 Letters, Most Words
by McA (Deacon) on Oct 04, 2013 at 21:06 UTC

    A ++ for admitting that there is missing something. That shows IMO real greatness.

    Best regards
    McA

      Update: Amongst other possible errors, this rejects words that have more than 8 letters; whereas it should only reject words with more than 8 unique letters!

      The corrected version is much slower. (And still running!)

      BTW: Where is your (attempt at) a solution? Or are you one of those that can only get their jollies critiquing others honest attempts?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Hi BrowserUK,

        my approach is still running. I'll present my "solution" as soon as I have the feeling I did it right. My solution is a real brute force solution. I will last some time as I have to loop over 13,884,156 permutations iterating over 35016 unique words.

        I hope I'll be right. ;-)

        McA

Re^4: Challenge: 8 Letters, Most Words
by Limbic~Region (Chancellor) on Oct 04, 2013 at 21:11 UTC
    CountZero,
    I am going to write a brute force solution this weekend but your math isn't quite right. First, you don't need all 26^8th since order doesn't matter. The upper bound is less than 5.2 million. We can reduce it further if we consider the maximum number of times a given letter can repeat. We do not have to consider a worst-case scenario dictionary so I think I can get that 5.2 million down to around 3 million. We still need to compare each of those 3 million against tens of thousands of words so I will be using C to do it but I think run time will be much less than 24 hours.

    Update: My math was off somewhere. The upper bound was 13_884_156 and I was only able to reduce it down to 12_461_993. Since my word list contains 40,933 words (filtered down from 81_536), that means I am up to 510_106_759_469 checks. In order to finish in under 24 hours, I will need to be able to do 5.9 million checks per second. I hope my C skills are up for the challenge (stay tuned).

    Cheers - L~R

      I did some more analysing the 2of12inf.txt dictionary and for words of maximum 8 characters, this is the maximum number of the same characters that can exist in any one word:
      a => 4, b => 3, c => 3, d => 4, e => 4, f => 4, g => 4, h => 3, i => 3, j => 2, k => 3, l => 4, m => 3, n => 4, o => 4, p => 3, q => 1, r => 4, s => 5, t => 3, u => 4, v => 2, w => 3, x => 2, y => 3, z => 4,
      In other words, choose all combinations of any 8 chars out of the following string: "aaaabbbcccddddeeeeffffgggghhhiiijjkkkllllmmmnnnnoooopppqrrrrssssstttuuuuvvwwwxxyyyzzzz" and run your dictionary against each of these and your search is exhaustive.

      Of course that is still about 2.14 E15 combinations to check, but many of these will be the same. So, perhaps someone with more mathematical skills than I can write an efficient generator for these combinations that does not have to go through them all to weed out the duplicates.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
        CountZero,
        I did some more analysing the 2of12inf.txt dictionary and for words of maximum 8 characters, this is the maximum number of the same characters that can exist in any one word:

        I had already done that which is where my math was coming from.

        Of course that is still about 2.14 E15 combinations to check, but many of these will be the same. So, perhaps someone with more mathematical skills than I can write an efficient generator for these combinations that does not have to go through them all to weed out the duplicates.

        Have you looked at my code? I already indicated that doing this reduces the number of 8 letter words that needs to be checked down to 12,461,993. The reason it grows back up to 510,106,759,469 loops/checks is because you still have to check each one of the 12.5 million against the 40K actual words. I was able to do this using C in 12 minutes.

        Cheers - L~R

        > In other words, choose all combinations of any 8 chars out of the following string: "aaaabbbcccddddeeeeffffgggghhhiiijjkkkllllmmmnnnnoooopppqrrrrssssstttuuuuvvwwwxxyyyzzzz"

        The point is to handle duplications, if all letters are unique the answer is simply (26 over 8)

        DB<288> sub fac {my $x=1; $x*=$_ for 2..$_[0]; $x} DB<289> sub binom { my ($n,$k)=@_; fac($n)/(fac($n-$k)*fac($k)) } DB<290> binom 26,8 => 1562275

        Reasoning is simple, it calculates all binary vectors of length 26 with exactly 8 1-bits.

        But with duplications its more complicated, e.g. 4 out of "aabcd" is not (5 over 4)=5

        a bcd abcd aa cd aab d aabc

        cause the first 2 solutions are identical.

        Generating all combinations and filtering the unique once is normally not very clever, cause the overhead can be enormous.

        DB<292> length "aaaabbbcccddddeeeeffffgggghhhiiijjkkkllllmmmnnnnoooo +pppqrrrrssssstttuuuuvvwwwxxyyyzzzz" => 86 DB<293> binom 86,8 => "53060358690"

        And I'm stuck finding a formula which calculates all unique solutions, but generating is easier, just don't allow bitvectors with "gaps" between identical letters:

        so

        "aaaabbbc..." # pattern "1000100...." # ok => ab... "1100100...." # ok => aab... "1001100.... # not ok => aab...

        I will update this post with a loop generating all possibilities soon.

        update

        indeed L~R's number of 12461993 possibilities is correct

        The following (non-optimized) code took 3 minutes to calculate them:

        Cheers Rolf

        ( addicted to the Perl Programming Language)

Re^4: Challenge: 8 Letters, Most Words
by choroba (Abbot) on Oct 04, 2013 at 21:57 UTC
    I am getting somehow similar results. The difference is, I guess, that I initialize the count with the number of occurrences of the combination before adding the shorter words. The script runs for 3min26s, the output ends with
    adeimrst: 310 aeilnrst: 312 adeinrst: 313 aeloprst: 313 aeilnpst: 314 adeoprst: 316 adeilrst: 319 aeimprst: 328 adeiprst: 332 aeilprst: 344 aeinprst: 346

    The script:

    Update: Checking the result:

    $ grep -E '^[aeinprst]{1,8}$' 2of12inf.txt | grep -vE '(.).*\1' | wc +-l 346

    Update 2: Testing duplicate characters:

    $ grep -E '(.).*\1.*:' 1056884.out | tail -n1 aeiprsst: 279 $ grep -E '^[aeiprsst]{1,8}$' 2of12inf.txt | grep -Ev '([aeiprt]).*\1| +s.*s.*s' | wc -l 279
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      choroba,
      I am bleary eye tired but I wanted to review your code because you mentioned it didn't consider all combinations - only possible combinations.

      If I have understood what you have done, this is not guaranteed to produce the correct answer on all dictionary files. It happens to get the right answer for 2of12inf.txt however. You appear to only be considering the combinations of letters of individual words rather than considering taking some from one word and some from another. Consider the following dictionary:

      abcd acdb adbc dabc bcad efgh fgeh hegf egfh fegh abcdxy efghlm
      The obvious answer is 'abcdefgh' which will get 10 words but your code chooses both 'abcdxy' and 'efghlm' with 6 words each.

      Cheers - L~R

        One should probably postprocess the results if the winner is shorter than 8. I will look into it later.
        لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1056932]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (12)
As of 2014-07-14 11:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (257 votes), past polls