Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Fast wordlist lookup for game

by jerrygarciuh (Curate)
on Oct 29, 2002 at 22:41 UTC ( #208899=perlquestion: print w/ replies, xml ) Need Help??
jerrygarciuh has asked for the wisdom of the Perl Monks concerning the following question:

Brethren,
I am contemplating making a game which will involve using perl to verify the existance of words in a list of around 50K words. Currently I am assuming this will involve something like this: divide wordlist alphabetically and then use a regex to check for
/\b$word\b/ in the appropriate list. I am assuming that using mySQL to store lists will be faster than using text files but please correct me if that is in error.
Please advise me of your opinions of the fastest way to check for a word in such a large list.
TIA
jg
  _____________________________________________________
"The man who grasps principles can successfully select his own methods.
The man who tries methods, ignoring principles, is sure to have trouble.
~ Ralph Waldo Emerson

Comment on Fast wordlist lookup for game
Select or Download Code
Re: Fast wordlist lookup for game
by gjb (Vicar) on Oct 29, 2002 at 22:47 UTC

    Simply use a hash if memory is not an issue (you should test this.

    my %dict; foreach my $word (@wordlist) { # or read from file ;-) $dict{$word} = 1; } if (exists $dict{'someWord'}) {...}

    Feel free to use (or modify) this code

    .

    Hope this helps, -gjb-

      Thanks for the reply but I have to say that memory always seems to be an issue if one uses expensive methods and the script gets heavily used at all.
      Thanks though,
      jg
        ...oops, I guess I finished my post after you'd already seen a similar answer.

        You're right that it's expensive to keep the list in memory, but it will be the fastest, and we're not talking a lot of memory here (given a list with words on average the size of the words in the Unix dictionary).

        Reading things into the hash is not he most elegant solution, and someone will no doubt come along here in a moment with a suggestion for a module that does more closely what you desire, but I wouldn't discount the simplest approach, unless your program will have other memory intensive techniques that cannot be avoided.

        ...All the world looks like -well- all the world, when your hammer is Perl.
        ---v

        Use a hash to minimize the search time. Keep the hash on disk to minimize the memory usage (and start-up time if this is a CGI app.) I.e. Use a DBM. For speed, you'll probably find BerkeleyDB is best.

        -j
        -sauoq
        "My two cents aren't worth a dime.";
        
Re: Fast wordlist lookup for game
by agentv (Friar) on Oct 29, 2002 at 22:53 UTC
    ...for a list of that size, you could probably read the words into a hash at startup time and then the hash access would take care of the lookup for you without the need to supply significant program logic on your part.

    The reason I say this is that I just looked at the size of /usr/dict/words on a Solaris system and it's about 25000 words in length. It weighs in at around 200000 bytes. This says to me that your hash would be a reasonable size.

    I ran a little test that went like this from the command-line:

    $ perl -de 0 DB<1> open F, "/usr/dict/words" DB<2> @f = <F> - there was a momentary delay - DB<3> print @f + 0 - debugger printed: 25143 -

    I'll see if I can scrap together a little script that does the lookup. Back in a bit.

    ...All the world looks like -well- all the world, when your hammer is Perl.
    ---v

Re: Fast wordlist lookup for game
by Aristotle (Chancellor) on Oct 29, 2002 at 23:12 UTC

    A hash seems reasonable to use at that dictionary size. 50k is nothing.

    Nevertheless, if you're still concerned, I suggest you use grep with eq on the list of words, rather than trying to match a pattern against a long string. That is going to be a lot more efficient.

    Something like this:

    my %word; open my $fh, "<", "/usr/dict/words"; chomp, push @{$word{substr $_, 0, 1}}, $_ while <$fh>; close $fh;
    And then you check using something like:
    if(grep $_ eq $user_input, @{$word{substr $user_input, 0, 1} || []}) { # it's there } else { # no, that wasn't it }
    If you want to ignore case, you'd just use
    chomp, $_ = lc, push @{$word{substr($_, 0, 1)}}, $_ while <$fh>; # and grep lc eq $user_input, @{$word{substr($user_input, 0, 1)} || []}
    respectively.

    Makeshifts last the longest.

Re: Fast wordlist lookup for game
by FamousLongAgo (Friar) on Oct 29, 2002 at 23:15 UTC
    I've been able to use lookup hashes in memory much larger than what you propose without any trouble, and there is nothing faster ( in Perl ) than hash lookup. In fact, I think your idea of using a regex isn't necessary - the plain hash lookup will be your best bet.

    You mentioned being worried about heavy use -- you may want to look into the Memoize module on CPAN - it lets you build a cache of most frequently used words with minimum effort. But I would still try the hash solution before you write too much code - you may be surprised at how little of a memory hit it is.
Re: Fast wordlist lookup for game
by grantm (Parson) on Oct 29, 2002 at 23:27 UTC

    Another approach would be to use a DBM file. You could use say the first three letters of the words as the key, so $dbm{aba} might contain 'abash abate ...'. It would be a less complex solution than mysql and should be faster too. Look at DB_File, GDBM_File, NDBM_File etc.

Re: Fast wordlist lookup for game
by BrowserUk (Pope) on Oct 30, 2002 at 02:20 UTC

    You could take your original idea of splitting the list across files keyed by first letter one step further and split them into subdirs named for the first letter and then files by the second. This would vastly reduce the memory overhead for the look up.

    I did this with my own words file containing over 600,000 words and then wrote a small sub to read the appropriate file and a m// to check for its existance and in a thoroughly unscientific benchmark used it to look up a word in the largest subset of each of the 26 first letters and it came out approximately 20 ms/word which should be fast enough for almost any purpose.

    The program to split the words file

    And the program for doing the lookup.

    #! perl -sw use strict; use Benchmark::Timer; sub IsWord { return if not @_ or length == 1; my ($word, $first, $second) = $_[0] =~ /^((.)(.).*)$/; local $_ = do{ local (@ARGV,$/) = "./$first/$second"; <> }; return /^$word$/im; } my $timer = Benchmark::Timer->new(skip=>0); for (@ARGV) { $timer->start; my $IsWord = IsWord( $_ ); $timer->stop; print $_, $IsWord ? ' is ' : ' is not ', 'a word', $/; } $timer->report; print "@{[$timer->data('_default')]}", $/; __END__

    The results and benchmark

      it came out approximately 20 ms/word ... I seriously doubt that any DBM solution will beat that!

      Well that sounds like a challenge :-)

      I started with a dictionary file containing about 100,000 words; created a GDBM file keyed on first 3 letters of words; and checked each word from a document containing 10479 words.

      Program to create index...

      If I tie the GDBM file once and check all the words the script runs in under a second which is (a) well under one millisecond per word and (b) probably not how you'd do it unless you had a dedicated wordcheck daemon

      If I tie the GDBM file once before checking each word (and then untie it afterwards) then checking all the words takes 50 seconds. Which is less than 5 milliseconds per word.

      use strict; use Benchmark; use GDBM_File ; my $dbm_file = 'words.dbm'; my %hash; $/ = undef; my $data = <DATA>; my @words = map {lc($_)} ($data =~ /(\w+)/g); print scalar(@words), "\n"; timethis(1, 'one_tie()'); timethis(1, 'many_ties()'); sub one_tie { my($found, $not_found) = (0, 0); tie %hash, 'GDBM_File', $dbm_file, &GDBM_WRCREAT, 0640; foreach (@words) { my $prefix = substr($_, 0, 3); if(exists($hash{$prefix}) and $hash{$prefix} =~ /\b$_\b/) { $found++; } else { $not_found++; } } untie %hash ; print "Found: $found\nNot found: $not_found\n"; } sub many_ties { my($found, $not_found) = (0, 0); foreach (@words) { tie %hash, 'GDBM_File', $dbm_file, &GDBM_WRCREAT, 0640; my $prefix = substr($_, 0, 3); if(exists($hash{$prefix}) and $hash{$prefix} =~ /\b$_\b/) { $found++; } else { $not_found++; } untie %hash ; } print "Found: $found\nNot found: $not_found\n"; } __DATA__ insert output of 'perldoc perltoot' here :-)

      Update: Of course you didn't mention what hardware you're running on - mines a 800Mhz PIII

      Update 2: Added code for generating index in <readmore> tags.

        Hmmm. Mine's a whimpy 233Mhz PII :^(

        Erm? Something like 20*233/800 = 5.825 plus some factor for greater efficiency of PIII v PII. To close too call without proper benchmarking I think (he clutches at a straw).

        It would be interesting to

        1. see how GDBM scales, I've no concept.
        2. to add a third char to the indexing of mine (maybe tomorrow).
        3. get them both running on the same box

        I must admit, when I said dbm, I was thinking RDMB (MySQL) rather (what I assume) is a B-tree DBM or similar. Never having used GDBM, I am very surprised at that performance. I guess I should keep my thoughts inward:^).

        Nice one++.


        Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

        As I said I would last night, I went back and adapted my to original programs to use 3-character indexing instead of two, to bring it into line with yours and benefit from the smaller linear search.

        The result, using my original 26 "worst case" words, is a reduction in average search time by nearly an order of magnitude from 20ms/word to 2.7ms/word, which is kind of pleasing.

        If you can make your 10k test file available to me, or we can agree on a suitable publically available test file, we could try comparing apples with apples:^).

        I do love it when someone makes me work for my one of my boasts!

        Indexer code

        Wordcheck test program

        Results and 'benchmark'.


        Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://208899]
Approved by adrianh
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2014-09-20 03:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (152 votes), past polls