Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^4: Fuzzy Searching: Optimizing Algorithm Selection

by demerphq (Chancellor)
on Nov 12, 2004 at 23:48 UTC ( [id://407567]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Fuzzy Searching: Optimizing Algorithm Selection
in thread Fuzzy Searching: Optimizing Algorithm Selection

Your math is all wrong.

Your 30,000,000 * 1275 = 38,250,000,000 comparisons.

I used an 11 digit key to generate my fuzzy list of 1275 words (i took a phone book and picked a number and generated all the one and two digit variants then searched against the book, the book was just under 35,000,000 numbers). Since the number of comparisons in a lookup into a Trie is the same as number of characters in the matched key the number of comparisons at absolute top used by the Trie based algorithm is thus 11 * N where N is the number of words in the dictionary. Thus the absolute maximum number of comparisons to search a 35 million word dictionary is 385 million character comparisons, in practice it will be signifgiantly less because of early bailouts.

Oh another cool thing about the trie approach is that it could do fuzzy matches on a whole host of strings simultaneously for exactly the same work. So the op said he needs to run a large set of these matches. He could generate his fuzzy list for each word walk the dictionary and have matched the full set in one go.

All of this assumes the thing i got wrong in the first place: that the match is start point anchored. I wasnt paying attention, and when i realized when I did my resubmit preview I added the bit starting "gah". Since i thought the ideas raised were interesting, and the qualifier present I didn't think it worth editing further. Maybe next time ill put in huge blinking letters "OOPS -- I ANSWERED THE WRONG QUESTION" so its more clear.

---
demerphq

Replies are listed 'Best First'.
Re^5: Fuzzy Searching: Optimizing Algorithm Selection
by BrowserUk (Patriarch) on Nov 13, 2004 at 00:36 UTC

    So what your saying is that my estimate of the performance of your Trie implementation is wildly optimistic.

    Therefore the number of comparisons/second you achieved is less than the 21,250,000 figure I used as a divider in the rest of my calculations.

    Ergo: Your proposal would take even longer to run.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://407567]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2024-04-25 13:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found