Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
BrowserUk,
Seems to me that you've fallen into a pattern of looking at things in terms of combinations, permutations & powersets et al.

You are absolutely correct. I did look at the problem from many different sides but, with the subconscious objective of precomputing everything, they all turned out to be different variations on the same theme.

You could build a trie, but they are not efficient built in terms of perl's available data structures.

Assuming that the word list file will remain fairly static and assuming it transforms into a data structure (trie or trie like) small enough that can stay memory resident, this seems like a reasonable approach. Using the word list you linked to earlier (TWL06.txt) with 178,590 eligible words, I use just under 80MB with the following solution:

#!/usr/bin/perl use strict; use warnings; use Getopt::Std; use Storable; use Time::HiRes qw/gettimeofday tv_interval/; my %opt; get_args(\%opt); if (defined $opt{w}) { my %data; open(my $fh, '<', $opt{w}) or die "Unable to open '$opt{w}' for re +ading: $!"; while (<$fh>) { chomp; next if length($_) < 3 || /[^a-zA-Z]/; $_ = lc($_); eval join('', 'push @{$data', (map {"{$_}"} sort split //, $_) +, "{words}}, '$_';"); } store(\%data, $opt{d}) or die "Can't store '%data' in '$opt{d}'\n" +; } my $data = retrieve($opt{d}); die "Unable to retrieve from '$opt{d}'\n" if ! defined $data; my $str = join('', map {('a' .. 'z')[rand 26]} 1 .. $opt{n}); print "Working from $str\n"; # Start time my $beg = [gettimeofday]; $str = join('', sort split //, $str); my @work; for (0 .. length($str) - 3) { my $tree = $data->{substr($str, $_, 1)}; push @work, [$tree, substr($str, $_ + 1)] if $tree; } my %seen; while (@work) { my $item = pop @work; my ($data, $str) = @$item; for (@{$data->{words} || []}) { print "$_\n" if ! $seen{$_}++; } my $last_pos = length($str) - 1; for (0 .. $last_pos) { my $tree = $data->{substr($str, $_, 1)}; next if ! $tree; my $new_str = $_ < $last_pos ? substr($str, $_ + 1) : ''; push @work, [$tree, $new_str]; } } my $end = [gettimeofday]; print "Found ", scalar keys %seen, " words in ", tv_interval($beg, $en +d), " seconds\n"; sub get_args { my ($opt) = @_; my $Usage = qq{Usage: $0 <-d <datafile>> [-n <chars> -w <wordfile> + -h] -h : This help message -d : The (d)atastructure file Note: This will be where the wordlist file is stored aft +er conversion -w : The (w)ordlist file Note: Specify this option to build a new datastructure -n : The (n)umber of random characters to form words from Default: 7 } . "\n"; getopts('hd:w:n:', $opt) or die $Usage; die $Usage if $opt->{h} || ! defined $opt->{d}; $opt->{n} = 7 if ! defined $opt->{n} || $opt->{n} =~ /\D/ || $opt- +>{n} < 3; }

The alternative is to use a set of bitstrings to to index the words containing each of the letters.

I have had a note to go back and figure out what you were doing here for over a year now. Today, I sat down to do just that. Would you mind reviewing what I have and correcting anything I got wrong? Note: I rewrote it in my own style as a mechanism for understanding it.

#!/usr/bin/perl use strict; use warnings; my $file = $ARGV[0] || 'TWL06.txt'; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; my @word; while (<$fh>) { chomp; next if length($_) < 3 || /[^a-zA-Z]/; push @word, lc($_); } my %index; for ('a' .. 'z') { # Bitstring long enough to represent all words - all bits set to 0 $index{$_} = chr(0) x int((@word + 8) / 8); } # For each unique char in every word # Set the bit corresponding to the index of the word to 1 for my $idx (0 .. $#word) { my %seen; for my $chr (sort split //, $word[$idx]) { vec($index{$chr}, $idx, 1) = 1 if ! $seen{$chr}++; } } print "Please enter an input string: "; chomp(my $input = <STDIN>); my @include = split //, $input; my @exclude = grep {! (1 + index($input, $_))} 'a' .. 'z'; # list of l +etters not in input string my $mask = chr(0) x int((@word + 8) / 8); # Turn on bits for all words that have at least 1 letter in common wit +h input string $mask |= $_ for @index{@include}; # Turn off bits for any word that contains at least 1 letter not in th +e input string for (@exclude) { # Words not containing excluded letter (though they may contain le +tters not in input string) my $remain = ~ $index{$_}; # Words that do not have excluded letter but do have letter in com +mon with input string $mask &= $remain; } for my $idx (0 .. $#word) { next if ! vec($mask, $idx, 1); # Not even a candidate next if ! finalCheck($word[$idx], $input); print "$word[$idx]\n"; } sub finalCheck { my ($candidate, $allowed) = @_; for (split //, $candidate) { $allowed =~ s/$_// or return; # return if the letter does not +remain in the candidate list } return 1; }

Assuming I understood it correctly, there isn't a lot of room for optimizations. Instead of recreating the zeroed bitstring 27 times, just do it once. The finalCheck() could be inlined (or converted to Inline::C). It may be faster to skip candidate words that are longer than the input string. You could also use Storable the same way I did to reduce the constant time construction of the data structure. I feel silly that I didn't spend some time a year ago to try and properly understand this as it is quite beautiful.

Cheers - L~R


In reply to Re^2: Turning A Problem Upside Down by Limbic~Region
in thread Turning A Problem Upside Down by Limbic~Region

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2024-03-29 08:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found