http://www.perlmonks.org?node_id=1056929


in reply to Challenge: 8 Letters, Most Words

Ok, I started with 2of12full.txt, removed anything with a capital letter (I'm assuming those are abbreviations or dupes) or non-alphabet characters (hyphenations, abbreviations, etc.), and removed all words of less than 2 characters or more than 8. I then ran this:
use strict; use warnings; $| = 1; my (%wc1, %wc2, $c); open (IN, '2-8-words.txt') || die; while (<IN>) { chomp; $_ = join '', sort split //; $wc1{$_}++; } open (IN, '2-8-words.txt') || die; while (<IN>) { chomp; $_ = join '.*?', sort split //; for my $s (keys %wc1) { $wc2{$s}++ if $s =~ /$_/; } print '.' if ++$c % 1000 == 0; } print "\n\n"; $c = 0; for (sort { $wc2{$b} <=> $wc2{$a} } keys %wc2) { print "$_ : $wc2{$_}\n"; last if ++$c == 50; }
Output:
aeilprst : 239 aeilnpst : 225 aeloprst : 220 aeilnrst : 219 aceiprst : 216 aeilmnrt : 207 aelprsty : 207 acenorst : 202 aeiprsty : 201 acelprst : 201 adeinrst : 199 acehorst : 198 aegilnrt : 197 aeinorst : 197 aeinoprt : 195 aeilnort : 194 adeilort : 191 aceinrst : 190 aelmoprt : 188 adeinort : 187 aegimnrt : 185 aemprstu : 184 adeilmst : 184 aceloprt : 182 adeginrt : 182 aceilnrt : 180 aceilprt : 180 aceimrst : 180 aegilnst : 179 acehnrst : 176 abeinrst : 175 aceilmrt : 175 eimoprst : 175 aceinort : 173 adeimort : 172 aeilnrtv : 172 adelorst : 172 aehmoprt : 171 aceilprs : 171 aehloprt : 171 acdehort : 170 abelrstu : 170 adeiorst : 170 aeoprstt : 169 acehiprs : 169 abelopst : 168 ademnopr : 168 adeilmry : 168 aeiprstv : 168 aeghorst : 167
This should be 100% accurate with any given word list (of a-z only), though it does take a little while to run, and if anyone can think of a way to speed it up without compromising accuracy, by all means do so.

Replies are listed 'Best First'.
Re^2: Challenge: 8 Letters, Most Words
by Limbic~Region (Chancellor) on Oct 04, 2013 at 20:55 UTC
    TJPride,
    You filtered things out that shouldn't have been (capitalization/punctuation don't matter for instance). Here is the code that I am using to filter my list:
    #!/usr/bin/perl use strict; use warnings; my %seen; open(my $fh, '<', '2of12inf.txt') or die "Unable to open '2of12inf.txt +' for reading: $!\n"; while (<$fh>) { chomp; $_ = lc($_); tr/a-z//cd; next if ! $_ || length($_) > 8 || $seen{$_}++; print "$_\n"; }

    This should be 100% accurate with any given word list

    I will be using the same dictionary with my brute force solution so I will let you know though I won't be filtering out all the words you have.

    Cheers - L~R

      Capitalization may not matter, but dupes do, and there are often multiple versions of the same word. I suppose I could have just lowercased everything and then deduped. The word list is largely irrelevant, though - the important thing is the algorithm.

      What do you mean by brute force, exactly? There are far too many possible letter combinations to just check each one.

        TJPride,
        What do you mean by brute force, exactly? There are far too many possible letter combinations to just check each one.

        As I pointed out elsewhere in this thread, you don't need to check all 26^8 possibilities since order of letters doesn't matter. That leaves 13_884_156 possible 8 letter combinations. Since I am using a specific dictionary and no word allows a single letter to repeat more than 5 times, I can further reduce that down to 12_461_993. Since my word list contains 40,933 words, that means I am up to 510_106_759_469 checks. In order to finish in under 24 hours, I will need to be able to do 5.9 million checks per second. I hope my C skills are up for the challenge (stay tuned).

        Cheers - L~R

      The '2of12inf.txt' already has all these filtered out, but some words have a '%' added (I do not know why, but you will have to delete the '%').

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
Re^2: Challenge: 8 Letters, Most Words
by McA (Priest) on Oct 04, 2013 at 20:50 UTC

    Hi,

    you print a list, but what is the 8 character sequence which solves the problem? Is it the first one?

    McA

      Well yes, of course.

        Hi,

        I'm asking because the output of your program does IMHO not give a hint of the right solution for the following dict:

        aa aab baa aba ccc cdc dcc ccd rs sr qt tq uv vu wx xw

        What do you think?

        Best regards
        McA