Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Algorithm Showdown: (A litany of failures)

by BrowserUk (Patriarch)
on Dec 11, 2004 at 07:40 UTC ( [id://414076]=note: print w/replies, xml ) Need Help??


in reply to Algorithm Showdown: Fuzzy Matching

Okay. You, and a few others seem to think that my attempt to walk away from this implies concession and sour grapes. It doesn't. I should have known (myself) better than to think I could anyway. Here are a few (more) reasons that I reject this second round of "pissing contest":

Failure 1 - a really simple test.

My code works.

Ysth's code works.

Your's fails, but it does it really fast!

>type keys.fail1 aat >type seqs.fail1 The fat cat tempted the rat under the hat with a vat of oat then pat t +he hat with a bat and sat down to on the mat to eat. >..\406836-2 keys.fail1 seqs.fail1 Loaded 1 keys at P:\test\406836-2.pl line 12. seq:00001 (00122) Fuzzy matched key:'aat' -v- 'at ' in line: 1 @ 9 ( 0 ++ 9) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'd t' in line: 1 @ 18 ( 0 ++ 18) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'rat' in line: 1 @ 24 ( 0 ++ 24) with fuzziness: 1 Fuzzy matched key:'aat' -v- 'at ' in line: 1 @ 39 ( 0 ++ 39) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'wit' in line: 1 @ 42 ( 0 ++ 42) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'at ' in line: 1 @ 57 ( 0 ++ 57) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'at ' in line: 1 @ 66 ( 0 ++ 66) with fuzziness: 2 Fuzzy matched key:'aat' -v- ' a ' in line: 1 @ 81 ( 0 ++ 81) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'bat' in line: 1 @ 84 ( 0 ++ 84) with fuzziness: 1 Fuzzy matched key:'aat' -v- ' an' in line: 1 @ 87 ( 0 ++ 87) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'at ' in line: 1 @ 93 ( 0 ++ 93) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'n t' in line: 1 @ 99 ( 0 ++ 99) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'n t' in line: 1 @ 105 ( 0 ++ 105) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'mat' in line: 1 @ 111 ( 0 ++ 111) with fuzziness: 1 Fuzzy matched key:'aat' -v- 'fat' in line: 1 @ 4 ( 1 ++ 3) with fuzziness: 1 Fuzzy matched key:'aat' -v- 't t' in line: 1 @ 10 ( 1 ++ 9) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'at ' in line: 1 @ 25 ( 1 ++ 24) with fuzziness: 2 Fuzzy matched key:'aat' -v- ' a ' in line: 1 @ 46 ( 1 ++ 45) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'vat' in line: 1 @ 49 ( 1 ++ 48) with fuzziness: 1 Fuzzy matched key:'aat' -v- 't t' in line: 1 @ 58 ( 1 ++ 57) with fuzziness: 2 Fuzzy matched key:'aat' -v- 't t' in line: 1 @ 67 ( 1 ++ 66) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'hat' in line: 1 @ 73 ( 1 ++ 72) with fuzziness: 1 Fuzzy matched key:'aat' -v- 'a b' in line: 1 @ 82 ( 1 ++ 81) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'at ' in line: 1 @ 85 ( 1 ++ 84) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'and' in line: 1 @ 88 ( 1 ++ 87) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'at ' in line: 1 @ 112 ( 1 ++ 111) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'eat' in line: 1 @ 118 ( 1 ++ 117) with fuzziness: 1 Fuzzy matched key:'aat' -v- 'at ' in line: 1 @ 5 ( 2 ++ 3) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'cat' in line: 1 @ 8 ( 2 ++ 6) with fuzziness: 1 Fuzzy matched key:'aat' -v- 'mpt' in line: 1 @ 14 ( 2 ++ 12) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'r t' in line: 1 @ 32 ( 2 ++ 30) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'hat' in line: 1 @ 38 ( 2 ++ 36) with fuzziness: 1 Fuzzy matched key:'aat' -v- 'a v' in line: 1 @ 47 ( 2 ++ 45) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'at ' in line: 1 @ 50 ( 2 ++ 48) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'oat' in line: 1 @ 56 ( 2 ++ 54) with fuzziness: 1 Fuzzy matched key:'aat' -v- 'pat' in line: 1 @ 65 ( 2 ++ 63) with fuzziness: 1 Fuzzy matched key:'aat' -v- 'at ' in line: 1 @ 74 ( 2 ++ 72) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'wit' in line: 1 @ 77 ( 2 ++ 75) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'sat' in line: 1 @ 92 ( 2 ++ 90) with fuzziness: 1 Fuzzy matched key:'aat' -v- 't t' in line: 1 @ 113 ( 2 ++ 111) with fuzziness: 2 Fuzzy matched key:'aat' -v- 'at.' in line: 1 @ 119 ( 2 ++ 117) with fuzziness: 2 Processed 1 sequences at P:\test\406836-2.pl line 52, <SEQ> line 1. Average length: 122 at P:\test\406836-2.pl line 53, <SEQ> line 1. Found 41 matches at P:\test\406836-2.pl line 54, <SEQ> line 1. >ysth -KEYLEN=3 keys.fail1 seqs.fail1 line:1 offset:4 fuzz: 1 'aat' line:1 offset:5 fuzz: 2 'aat' line:1 offset:8 fuzz: 1 'aat' line:1 offset:9 fuzz: 2 'aat' line:1 offset:10 fuzz: 2 'aat' line:1 offset:14 fuzz: 2 'aat' line:1 offset:18 fuzz: 2 'aat' line:1 offset:24 fuzz: 1 'aat' line:1 offset:25 fuzz: 2 'aat' line:1 offset:32 fuzz: 2 'aat' line:1 offset:38 fuzz: 1 'aat' line:1 offset:39 fuzz: 2 'aat' line:1 offset:42 fuzz: 2 'aat' line:1 offset:46 fuzz: 2 'aat' line:1 offset:47 fuzz: 2 'aat' line:1 offset:49 fuzz: 1 'aat' line:1 offset:50 fuzz: 2 'aat' line:1 offset:56 fuzz: 1 'aat' line:1 offset:57 fuzz: 2 'aat' line:1 offset:58 fuzz: 2 'aat' line:1 offset:65 fuzz: 1 'aat' line:1 offset:66 fuzz: 2 'aat' line:1 offset:67 fuzz: 2 'aat' line:1 offset:73 fuzz: 1 'aat' line:1 offset:74 fuzz: 2 'aat' line:1 offset:77 fuzz: 2 'aat' line:1 offset:81 fuzz: 2 'aat' line:1 offset:82 fuzz: 2 'aat' line:1 offset:84 fuzz: 1 'aat' line:1 offset:85 fuzz: 2 'aat' line:1 offset:87 fuzz: 2 'aat' line:1 offset:88 fuzz: 2 'aat' line:1 offset:92 fuzz: 1 'aat' line:1 offset:93 fuzz: 2 'aat' line:1 offset:99 fuzz: 2 'aat' line:1 offset:105 fuzz: 2 'aat' line:1 offset:111 fuzz: 1 'aat' line:1 offset:112 fuzz: 2 'aat' line:1 offset:113 fuzz: 2 'aat' line:1 offset:118 fuzz: 1 'aat' line:1 offset:119 fuzz: 2 'aat' Found 41 matches in (secs): 0.0565400123596191 >demerphq -KEYLEN=3 keys.fail1 seqs.fail1 Found 0 matches in (secs): 0.00815510749816895

Failure 2 -- use a slightly different alphabet.

My code works and finds no matches.

Ysth's code works and finds no matches.

Your's produces fiftytwo thousand errors, really fast.

>..\406836-2 keys.fail2 seqs.fail2 Loaded 100000 keys at P:\test\406836-2.pl line 12. seq:00001 (01000) Processed 1 sequences at P:\test\406836-2.pl line 52, <SEQ> line 1. Average length: 1000 at P:\test\406836-2.pl line 53, <SEQ> line 1. Found 0 matches at P:\test\406836-2.pl line 54, <SEQ> line 1. >ysth keys.fail2 seqs.fail2 Found 0 matches in (secs): 3.18631100654602 >demerphq keys.fail2 seqs.fail2 2> log >wc -l log 52489 log >u:head log Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. >u:tail log Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Use of uninitialized value in addition (+) at Fuzzy/Matcher/DFA.pm lin +e 97. Found 0 matches in (secs): 5.7396399974823

Failure 3 -- verify the number of matches found is correct.

For this I use the set of all 2701 possible, 25-char, 2-mismatch fuzzy matches for the subsequence 'A' x 25.

By making the test sequence(s) all 'A's, then every key will match at every offset in the sequence, in every sequence.

This allows me to verify that all possible positions are checked, all possible fuzzy matches are matched and allows a simple calculation to verify the total match count from an implementation.

For an 'A'x1000-char sequence, and 25-char x 2-mismatch keys = (1000-25+1) * 2701 = 2636176 matches.

>demerphq -MEM=1 -FUZZ=2 -KEYLEN=25 keys.test seqs.test >nul Found 2636176 matches in (secs): 33.2997260093689 Mem: 313,772 K >..\406836-5 -MEM=1 -FUZZ=2 -KEYLEN=25 keys.test seqs.test >nul Found 2636176 matches in (secs): 44.1074421405792 Mem: 3,048 K >ysth -MEM=1 -FUZZ=2 -KEYLEN=25 keys.test seqs.test >nul Found 2636176 matches in (secs): 66.0299181938171 Mem: 4,424 K

Yep! Your code sure is quick, but look at the memory consumption!

What happens if

  1. I push that sequence to 'A'x10,000 or 'A'x100,000?
  2. Or double the number of keys?
  3. Or increase the fuzz factor?

Mine handles it--slowly.

ysth's handles it--slowly.

Yours, churns and churns, and fails-- really, really slowly.

Eventually being terminated by the system due to lack of virtual memory. I have 1GB available.

>ysth -MEM=1 -FUZZ=4 keys.fail3 seqs.fail3 >nul Found 4498920 matches in (secs): 4612.35819220543 Mem: 50,284 K >..\406836-4 -MEM=1 -FUZZ=4 keys.fail3 seqs.fail3 >nul Found 4498920 matches in (secs): 8196.37320613861 Mem: 9,664 K >demerphq -MEM=1 -FUZZ=4 keys.fail3 seqs.fail3 >nul >rem Terminated by system when it had consumed all 512 MB of ram and a +ll my swap space.

And I do hope that all those that disapproved of my attempt to walk away from this intensely frustrating waste of effort will notice that last test alone required nearly 5 hours of my time--at 99% cpu usage making doing anything else productive impossible. I wonder how many of those that have demonstrated their disapproval have even bothered to read the crufty code I'm having to pick apart and debunk? Unlike demerphq, I don't abandon a test after less than two minutes just cos it's inconvenient!

Failure 4 -- mixed length fragments

This is harder to demonstrate convincingly, and easily dismissed by reference back to the OP with it's description of 25-char sequences. However, every description I have found of the processes (Digestion and Shredding) that biochemists use in the Cloning, Mapping and Sequencing of DNA, produces large sets of fragments, of variable length.

Every version of my algorithm published, and my latest, most effcient version which is one or two orders of magnitude more efficient than the best so far published, happily deals with every key having a different length. Indeed, the -KEYLEN=25 parameter shown in the examples above is internally overridden and only used on my versions for consistancy.

As I stated elsewhere, I can make my code run much more quickly, by making the same assumption as both yours, and ysth's code (which he openly noted) make. I have avoided doing this because I want a generally applicable algorithm, whether for genome work or fuzzy name matching, or looking for postcodes or any other type of fuzzy match where the search keys can be any mixed bag of words.

You mentioned somewhere that your algorithm could be adapted to do this, but as it relies heavily for at least part of its performance upon ysth's multi-indexing mechanism, and he has stated that his algorthm would be affected, I can only assume that your implementation would be also.

The ethics of comparative studies

Before any comparsions are made, and results are published, it is pretty standard practice to ensure that the algorithms under test:

  1. Accept the same inputs.
  2. Produce the same results.
  3. Make the same assuptions.
  4. Have the same limitations.

Ie. If your going to fairly and accurately compare different algorithms, then you agree these criteria, or variancies from it before you published huge, dramatic sets of statistics. And when you do, you note those assumptions and variencies.

I brought this to your attention well in advance of your act of "publish and be damned", and you chose to ignore it.

So, you got damned.

And given that we have been here before, I have no intention of withdrawing one syllable of that condemnation, even if it does go against my stated "don't be rude" philosophy. It is clear from the record of those posts, that your's was a knowing, deliberate act, and I called it as such.

Summation

So, let me sum up. You create an Inline-C implementation of a Trie, use it as a faster replacement for the standard hash in ysth's implementation of his brilliant, innovative algorithm, which itself relies on my xor technique to determine the fuzz factor.

What part did your original algorithm play in this? Zero! Nada! Not - a - jot!.

You then wrap it up in a crufty, over-engineered test harness that serves to guarantee that only idiot's like me are ever going to download, install and run it; to produce a big table of impressive looking statistics comparing this 2-week, 3-man extravagaza, with a peice of code knocked up on the spot, in under an hour, and use it "claim victory"?

However, your Trie implementation--as used for those tests--is so domain specific in it's contruction, that it is severly limited in it's application. Any single change--an extra character in the alphabet (ACGTX); or lowercase letters; a fuzz factor higher than 4 in combination with any number of pathalogical testcases--causes it to variously: produce no output; or produce reams of errors and no matches; or exhaust all available memory before ever producing a single match.

And the kicker. Your code does not yet produce the required information. As tested and lauded, your code outputs:

  1. The position in the sequence where the match was found.
  2. The the fuzz factor (as determined via the xor method).
  3. And the key-length sequence of characters from that position.

But what is missing? The value or index position of the key that was matched.

So now you have a file full of sub-sequences that match a key--which if you'd bothered to record the sequence number, as well as the offset, would be redundant anyway--all of which matched one of the (100,000 or 500,000 or 1,000,000) keys in the keys file. But which one?

Sure, you can post process the two together to determine the information, but that's damn nearly as big a job as the original problem.

And what will you use to fuzzy-match those fuzzy-matched sub-sequences back to the keys they matched? The xor method of course!

Yes. If used within the very specific domain of inputs, your combination of ysth's multi-indexing, a fast Inline-C Trie implementation, and the xor method does result in a very fast solution to the original problem. But maybe your module should be named:

Fuzzy(not greater than 4)::Matcher(if you don't need mixed case)::DFA(+multi-index+xor)::charset(ACGT).pm

There is nothing wrong with cross-fertilisation. Indeed, I congratulated ysth on his innovation the moment I understood how clever it is.

I'm in the processes of nicking a few ideas myself. It probably still won't (quite) match your performance on sparse datasets, but it will

  1. continue to handle the full range of characters, including unicode;
  2. continue to handle sets of variable length keys;
  3. continue to handle any degree of fuzz at no extra cost;
  4. continue to have constant (and low) memory consumption;
  5. and continue to produce all the required information at the first pass.

But for you to post that table of meaningless drivel under the heading "Algorithm shootout", when 90% of it isn't even your algorithm and bears almost no relationship to your original "proof-of-concept" code; compare it against my off-the-cuff implementation--which still works!; Eggs with bananas!

Gah! Words finally fail me. Nothing I want to add at this point is publishable.

Addendum: And my detailed analysis of both your original algorithm, and this implementation of an entirely different one, has shown me nothing that would cause me to recind any of the fundamental points of my original analysis on the applicability of Tries this class of problem. Some of the projected numbers change, sometimes by an order of magnitude or two, but the underlying basic difficiencies remain.

Without using ysth's innovation to allow you to avoid a brute force linear search of every sub-sequences, using a Trie would be a linear O(nm) process.

And without the use of the xor algorithm to determine the fuzz factor, you would need to be storing this information in the Trie. But for the same reasons as you cannot related the fuzzy-matched subsequence back to the key that it matched, you also cannot determine the fuzz-factor without the xor method.

This is as I identified, the simple reality that Tries work by uniquifying their keys. The process of producing all the fuzzy match possibility for each of the 100,000 keys will produce many duplicates and these will all be folded down to a single path in the Trie. Yes, you can associate a value with that path, but which value? An array of input-key/fuzz-factor-that-produced-this-path pairs? But then how do you determine which of those pairings was responsible for this match?

And what will attaching all that extra information in the Trie do for your memory consumption? Boom after 50 keys like your "proof-of-concept"!


Examine what is said, not who speaks.        The end of an era!
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Replies are listed 'Best First'.
Re^2: Algorithm Showdown: (A litany of failures)
by BrowserUk (Patriarch) on Dec 11, 2004 at 08:54 UTC

    Here are the test scripts used in the parent post exercises.

    408636-4.pl

    demerphq.pl

    ysth.pl

    A simple script for automating mass testing of the above scripts, plus any other candidates over any number of tests.

    Not used for the exercises above, because it was written, while the above tests were running, as an antithisis of this and this, to show how simple it is to follow the K.I.S.S principle. It's trivially incomplete, but the essence of it is there and works.

    showdown2.pl


    Examine what is said, not who speaks.        The end of an era!
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
    A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://414076]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2024-04-23 09:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found