Well BrowserUk im afraid the numbers just dont agree with you at all.
No. Your figures don't agree. Your very selective figures.
The OP presented the question in the form
The basic problem:
Matching a 25-letter string against a dictionary of about 30 thousand variable length sequences. I need to find all:
- exact
- fuzzy (1- and 2-letter mismatches)
within the dictionary.
Performance matters because I have a library of several hundred thousand 25-letter strings. And because I'll be repeating this analysis for all pairwise combinations of about 10 search and 20 target dictionaries.
I've highlighted the (IMO, which is as good as any other as the OP hasn't returned to clarify matters) significant figures in that quote.
Your table is very interesting. Particularly in what it doesn't show.
The most important point is that your algorithm does not scale!
The first thing I noticed was that the biggest number in the OPs presentation of the problem, was the one you chose to vary through the most limited range?
He specified, clearly, 100,000 25-char keys ("words" as you refer to them). You only varied this value through a range of 1, 2, 5 & 10. I wondered why?
The following is the output from your unmodified program, apart from the addition of one extra line of diagnostics which I'll post at the bottom.
[ 5:40:10.75] P:\test\demerphq>del *.fuzz rem Ensure a clean plate
[ 5:40:16.21] P:\test\demerphq>dir
Volume in drive P has no label.
Volume Serial Number is BCCA-B4CC
Directory of P:\test\demerphq
18/11/2004 05:40 <DIR> .
18/11/2004 05:40 <DIR> ..
18/11/2004 05:38 9,597 demerphq.pl
1 File(s) 9,597 bytes
2 Dir(s) 48,390,365,184 bytes free
[ 5:40:21.90] P:\test\demerphq>perl demerphq.pl
--word_size=25 --string_size=1000 --words=1
2>&1 | find "****"
**** WordSize: 25
StringSize: 1000
WordCount: 1
StringCount: 1000
Xor: 4.453125
Trie: 10.390625 (1.859375 + 8.53125)
**** perl.exe 416 0 12,320 K
[ 5:40:43.59] P:\test\demerphq>perl demerphq.pl
--word_size=25 --string_size=1000 --words=10
2>&1 | find "****"
**** WordSize: 25
StringSize: 1000
WordCount: 10
StringCount: 1000
Xor: 21.0625
Trie: 95.796875 (2.328125 + 93.46875)
**** perl.exe 3196 0 78,164 K
[ 5:42:51.90] P:\test\demerphq>perl demerphq.pl
--word_size=25 --string_size=1000 --words=100
2>&1 | find "****"
#### The above run self-terminated (crashed) because it was out of mem
+ory!!! ####
[ 5:49:20.62] P:\test\demerphq>perl demerphq.pl
--word_size=25 --string_size=1000 --words=20
2>&1 | find "****"
**** WordSize: 25
StringSize: 1000
WordCount: 20
StringCount: 1000
Xor: 41.453125
Trie: 195.71875 (2.5 + 193.21875)
**** perl.exe 2708 0 149,812 K
[ 5:53:46.56] P:\test\demerphq>perl demerphq.pl
--word_size=25 --string_size=1000 --words=30
2>&1 | find "****"
**** WordSize: 25
StringSize: 1000
WordCount: 30
StringCount: 1000
Xor: 59.59375
Trie: 293.75 (2.59375 + 291.15625)
**** perl.exe 3504 0 222,616 K
[ 6:00:00.59] P:\test\demerphq>perl demerphq.pl
--word_size=25 --string_size=1000 --words=40
2>&1 | find "****"
**** WordSize: 25
StringSize: 1000
WordCount: 40
StringCount: 1000
Xor: 79.265625
Trie: 404.96875 (2.734375 + 402.234375)
**** perl.exe 3412 0 293,504 K
[ 6:13:48.76] P:\test\demerphq>perl demerphq.pl
--word_size=25 --string_size=1000 --words=50
2>&1 | find "****"
**** WordSize:
25 StringSize: 1000
WordCount: 50
StringCount: 1000
Xor: 97.546875
Trie: 494.3125 (2.796875 + 491.515625)
**** perl.exe 3524 0 366,868 K
As you can see, I filtered the output to show just the headline numbers, and the extra line of diagnostics I added. I've also hand wrapped the output for the usual reasons.
The bottom line here is that if you have the resources, long term you are going to be better off with a technique like this than a technique like yours.
You even predicted the problem, but dismissed it
If you look at the third run I attempted above, you'll see that I annotated it as running out of memory. I did a little math based on the memory consumption figures output using the MS tasklist.exe utility.
- 1 x 25-char key requires 12 MB.
- 10 x 25-char keys (words) requires 78 MB
- 20 x 25-char keys (words) requires 149 MB (delta for 10 words: 71 MB.)
- 30 x 25-char keys (words) requires 222 MB (delta for 10 words: 73 MB.)
- 40 x 25-char keys (words) requires 293 MB (delta for 10 words: 71 MB.)
- 50 x 25-char keys (words) requires 366 MB (delta for 10 words: 73 MB.)
That's pretty consistant, and linear, at > 70 MB / 10 keys.
So, calculating the RAM requirement to build the MegaTrie DFA that your algorithm depends on for it's speed, comes out to:
100,000 x 25-char keys (words) requires 696 Gigabytes of RAM.
As no $5000 system I am aware of is capable of addressing that volume of ram, the question of whether the sequences are 1,000-chars or 10,000-chars long, and which would be the faster algorithm if that was the OPs requirement, simply doesn't arise--your algorithm doesn't get going, because you cannot build your MegaTrie/DFA!
The Xor algorithm does handle "several hundred thousand keys easily.
Update: I forgot to post the one line I changed in your program:
## This:
exit(main(@ARGV));
## became this:
main(@ARGV);
print "**** $_" for `tasklist /NH /FI "PID eq $$"`;
exit(0);
Type tasklist /? on your local XP system for an explaination of the parameters.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon