Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really!)

by BrowserUk (Patriarch)
on Nov 18, 2004 at 08:24 UTC ( [id://408714]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Oh really? Code and Timings)
in thread Fuzzy Searching: Optimizing Algorithm Selection

Well BrowserUk im afraid the numbers just dont agree with you at all.

No. Your figures don't agree. Your very selective figures.

The OP presented the question in the form

The basic problem:

Matching a 25-letter string against a dictionary of about 30 thousand variable length sequences. I need to find all:

  1. exact
  2. fuzzy (1- and 2-letter mismatches)

within the dictionary.

Performance matters because I have a library of several hundred thousand 25-letter strings. And because I'll be repeating this analysis for all pairwise combinations of about 10 search and 20 target dictionaries.

I've highlighted the (IMO, which is as good as any other as the OP hasn't returned to clarify matters) significant figures in that quote.

Your table is very interesting. Particularly in what it doesn't show.

The most important point is that your algorithm does not scale!

The first thing I noticed was that the biggest number in the OPs presentation of the problem, was the one you chose to vary through the most limited range?

He specified, clearly, 100,000 25-char keys ("words" as you refer to them). You only varied this value through a range of 1, 2, 5 & 10. I wondered why?

The following is the output from your unmodified program, apart from the addition of one extra line of diagnostics which I'll post at the bottom.

[ 5:40:10.75] P:\test\demerphq>del *.fuzz rem Ensure a clean plate [ 5:40:16.21] P:\test\demerphq>dir Volume in drive P has no label. Volume Serial Number is BCCA-B4CC Directory of P:\test\demerphq 18/11/2004 05:40 <DIR> . 18/11/2004 05:40 <DIR> .. 18/11/2004 05:38 9,597 demerphq.pl 1 File(s) 9,597 bytes 2 Dir(s) 48,390,365,184 bytes free [ 5:40:21.90] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=1 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 1 StringCount: 1000 Xor: 4.453125 Trie: 10.390625 (1.859375 + 8.53125) **** perl.exe 416 0 12,320 K [ 5:40:43.59] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=10 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 10 StringCount: 1000 Xor: 21.0625 Trie: 95.796875 (2.328125 + 93.46875) **** perl.exe 3196 0 78,164 K [ 5:42:51.90] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=100 2>&1 | find "****" #### The above run self-terminated (crashed) because it was out of mem +ory!!! #### [ 5:49:20.62] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=20 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 20 StringCount: 1000 Xor: 41.453125 Trie: 195.71875 (2.5 + 193.21875) **** perl.exe 2708 0 149,812 K [ 5:53:46.56] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=30 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 30 StringCount: 1000 Xor: 59.59375 Trie: 293.75 (2.59375 + 291.15625) **** perl.exe 3504 0 222,616 K [ 6:00:00.59] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=40 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 40 StringCount: 1000 Xor: 79.265625 Trie: 404.96875 (2.734375 + 402.234375) **** perl.exe 3412 0 293,504 K [ 6:13:48.76] P:\test\demerphq>perl demerphq.pl --word_size=25 --string_size=1000 --words=50 2>&1 | find "****" **** WordSize: 25 StringSize: 1000 WordCount: 50 StringCount: 1000 Xor: 97.546875 Trie: 494.3125 (2.796875 + 491.515625) **** perl.exe 3524 0 366,868 K

As you can see, I filtered the output to show just the headline numbers, and the extra line of diagnostics I added. I've also hand wrapped the output for the usual reasons.

The bottom line here is that if you have the resources, long term you are going to be better off with a technique like this than a technique like yours.

You even predicted the problem, but dismissed it

If you look at the third run I attempted above, you'll see that I annotated it as running out of memory. I did a little math based on the memory consumption figures output using the MS tasklist.exe utility.

  1. 1 x 25-char key requires 12 MB.
  2. 10 x 25-char keys (words) requires 78 MB
  3. 20 x 25-char keys (words) requires 149 MB (delta for 10 words: 71 MB.)
  4. 30 x 25-char keys (words) requires 222 MB (delta for 10 words: 73 MB.)
  5. 40 x 25-char keys (words) requires 293 MB (delta for 10 words: 71 MB.)
  6. 50 x 25-char keys (words) requires 366 MB (delta for 10 words: 73 MB.)

That's pretty consistant, and linear, at > 70 MB / 10 keys.

So, calculating the RAM requirement to build the MegaTrie DFA that your algorithm depends on for it's speed, comes out to:

100,000 x 25-char keys (words) requires 696 Gigabytes of RAM.

As no $5000 system I am aware of is capable of addressing that volume of ram, the question of whether the sequences are 1,000-chars or 10,000-chars long, and which would be the faster algorithm if that was the OPs requirement, simply doesn't arise--your algorithm doesn't get going, because you cannot build your MegaTrie/DFA!

The Xor algorithm does handle "several hundred thousand keys easily.

Update: I forgot to post the one line I changed in your program:

## This: exit(main(@ARGV)); ## became this: main(@ARGV); print "**** $_" for `tasklist /NH /FI "PID eq $$"`; exit(0);

Type tasklist /? on your local XP system for an explaination of the parameters.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
  • Comment on Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really!)
  • Select or Download Code

Replies are listed 'Best First'.
Re^6: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Nope.)
by demerphq (Chancellor) on Nov 18, 2004 at 09:44 UTC

    The Xor algorithm does handle "several hundred thousand keys easily.

    No the Xor algorith does NOT handle several hundred thousand keys easily. The rate of growth on the run time for searching a string 100k long is too high. Look at the numbers, the algorithm gets slower the more keys there are involved. So slow that you could construct Trie/DFA for a subset and run them in series and still outperform your algorithm. I chose the numbers I chose becuase if your algorithm blows out at those numbers it has already totally lost the game.

    Look at the numbers. For a 10 x25 char words searching 100k strings my solution took 5.25 minutes. Your solution took half an hour. So you could run mine 6 times and yours still wouldnt have finished. Thus mine would do 60 words in 30 minutes while yours was still struggling with 10 words. Adding words appears to make your solution grow faster than mine.

    Now mine could be made even more efficient. As I said the construction routine im using is far from optimal. Im pretty sure it can be made to run in linear time to the number of keys. Not only that but all of your memory calculation are like your other calculations: misleading. An optimized respresentation for this problem should require only 20 bytes per node (5 * length(long)). Consider a single empty hash and the reference to it (of which there are hundreds of thousands in use in my prototype code) takes more memory than that, at least double as an SV itself takes 28bytes. So the scale of problems this solution can handle in production quality code gets much much larger than the toy implementation here.

    Anyway, I made an assertion in my original node in this thread. Ive now proved it with _real_ numbers and working code. You claimed it was nonsense. It has proved to not to be. And all of your wriggling and trying to rewrite the situation to make your claims true dont work. I made an assertion and my assertion has proved true. Within the parameters it is feasable to use a DFA/Trie it will smoke your Xor solution for cases where the string being searched is fairly long. Your objections to that claim have proved incorrect, as has your math.

    You confidentally calculated that it would take years for a solution like this to run and you have been proved utterly wrong. Even if you went with the current code and ran it 100 times it still would only be 525 minutes to search for 1000 words over 1000x100k strings. Looking at the run time between 5 words and 10 words over 1000x100k strings for your solution we see a 1000 second increase. Since your operation will scale proportionally to the number of words and the length of the string we can conclude that to process 1000 words on the same data set will take your solution about 3000 minutes. Which is 2500 minutes LONGER than mine. Thats 41 hours longer. And you want to go this route with 100k words? Using the same logic 100k words would take your solution 20 million minutes, thats ~38 years. Versus the approximately 36 days it would take mine. I know which one i would prefer to wait for.

    To make this picture even worse for your solution id like to point out that in my earlier code I was effectively doing $Fuzzy=5 by accident and it was still outpacing your solution. That added tens of thousands of words to the search list. If it was possible to do that with a pure perl soltion imagine what a String/Struct/Inline::C/Perl hybrid would do. Using a small fraction of the ram and doing the search operations in C means that my solution just gets faster.

    All told I think its quite clear that I have proved my point in my original post. A state machine based scan will outperform anything posted in this thread. And it has. I know you will come back with more of the same here, but I'm no longer interested. You insulted me and you posted a bunch of nonsense math saying im all wrong. And now when i prove quite conclusively that you were wrong have you gone an updated any of that to point out its wrong? Have you added any caveats anywhere? Isnt this why you were insulting and abusive in the CB to me? You objected to me assertion my proposal was more efficient than yours and not being more clear in the caveat I added at the bottom, what about your nodes? They are full of totally nonsensical equations and arguments supposedly proving how my approach just cant work, but hey it outperforms yours. Thats a fact that you cant get around no matter how hard you try.

    ---
    demerphq

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://408714]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-03-29 15:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found