Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really, really!)

by BrowserUk (Patriarch)
on Nov 27, 2004 at 13:24 UTC ( [id://410708]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Oh really? Code and Timings)
in thread Fuzzy Searching: Optimizing Algorithm Selection

demerphq The numbers in your table are meaningless.

  1. Your solution is broken.

    To show this, I ran your test harness (slightly modified to accept filenames from the command line and disable the "only output the first 50 matches" code) on two small testfiles. The first was generated using this small script which generates all the variations of a specified length key; I used 10-char keys (406 variations) because your prototype code cannot handle 25-chars (2701 variations).

    #! perl -slw use strict; our $LEN ||= 10; my $s = 'A' x $LEN; print $s; for my $p1 ( 0 .. $LEN -1 ) { for my $c1 ( qw[ C G T ] ) { for my $p2 ( $p1 .. $LEN -1 ) { next if $p1 == $p2; for my $c2 ( qw[ C G T ] ) { my $t = $s; substr $t, $p1, 1, $c1; substr $t, $p2, 1, $c2; print $t; } } } } __END__ P:\test\demerphq>keyGen -LEN=10 >test.words P:\test\demerphq>wc -l test.words 406 test.words

    Here i've printed the first and last 10 to show the sequence:

    P:\test\demerphq>u:head test.words AAAAAAAAAA CCAAAAAAAA CGAAAAAAAA CTAAAAAAAA CACAAAAAAA CAGAAAAAAA CATAAAAAAA CAACAAAAAA CAAGAAAAAA CAATAAAAAA P:\test\demerphq>u:tail test.words AAAAAAATAT AAAAAAAACC AAAAAAAACG AAAAAAAACT AAAAAAAAGC AAAAAAAAGG AAAAAAAAGT AAAAAAAATC AAAAAAAATG AAAAAAAATT

    I used the following, 11-char single sequence for the test:

    P:\test\demerphq>type test.seq AAAAAAAAAAA

    The thing to note is that every one of the 406 keys in test.words will match that single 11-char sequence. Twice, once at offset 0, and again at offset 1.

    Now the run.

    [11:37:48.89] P:\test\demerphq>TrieXor.orig.pl test.words test.seq %#% 11:37:51 ### Start %#% 11:37:51 ( 0.0048 : 0.0048) Finish Total Elapsed Time: 0.00477910041809082 secs %#% 11:37:51 ### Starting TRIE %#% 11:37:51 ( 0.0029 : 0.0029) Got 406 to fuzzyify %#% 11:37:52 ( 1.0381 : 1.0411) Fuzzy Words:16906 [39678 nodes i +n tree] %#% 11:37:56 ( 4.7969 : 5.8380) Finised Traverse %#% 11:37:56 ( 0.0000 : 5.8380) Finish Total Elapsed Time: 5.83795619010925 secs %#% 11:37:56 ### Starting Trie Scan %#% 11:37:56 ( 0.0000 : 0.0000) TRIE #0001: 0002 > : 000000:AAAAAAAAAA:0, 000001:AAAAAAAAAA:0 %#% 11:37:56 ( 0.0000 : 0.0000) Trie Done %#% 11:37:56 ( 0.0313 : 0.0313) Finish Total Elapsed Time: 0.03125 secs %#% 11:37:56 ### Starting xor_fuzz search %#% 11:37:56 ( 0.0000 : 0.0000) xor_fuzz loading 'test.words' %#% 11:37:56 ( 0.0000 : 0.0000) Loaded 406 10-ers %#% 11:37:56 ( 0.0000 : 0.0000) XOR #0001: 0000 > : 000000:ACAAAAAAGA:2, 000000:GAAAATAAAA:2, 000000:CAACAAAAAA:2, 000000:AAAACAAAAC:2, 000000:AAACATAAAA:2, 000000:AACAATAAAA:2, 000000:AGAAAACAAA:2, 000000:ATGAAAAAAA:2, 000000:CAAAAATAAA:2, 000000:AAAAAGAATA:2, 000000:AATAAAAGAA:2, 000000:AAACAAAAAC:2, 000000:AAAAAAAGAT:2, 000000:AGAAAAAAAT:2, 000000:AAACAAACAA:2, 000000:AAAAAAAACG:2, 000000:AAAGAAACAA:2, 000000:ATACAAAAAA:2, 000000:AAAAACAACA:2, 000000:AAACAAAGAA:2, 000000:ACAAAGAAAA:2, 000000:AAACAAAAAT:2, 000000:TAAAAGAAAA:2, 000000:AAAAAATCAA:2, 000000:AAAAAATAAT:2, 000000:AAAACCAAAA:2, 000000:AAATAAAGAA:2, 000000:ATAAAAAACA:2, 000000:AAGAAATAAA:2, 000000:AAACCAAAAA:2, 000000:AAAAGGAAAA:2, 000000:AAGAAAGAAA:2, 000000:AAGTAAAAAA:2, 000000:AAAAAAGAAC:2, 000000:ACAAAAAATA:2, 000000:AACAAAATAA:2, 000000:AATGAAAAAA:2, 000000:ACTAAAAAAA:2, 000000:CAAAAAAAAC:2, 000000:AAAAAACGAA:2, 000000:AAATAAAAGA:2, 000000:GAAACAAAAA:2, 000000:AAAAGAAAGA:2, 000000:AAAATAAGAA:2, 000000:ACATAAAAAA:2, 000000:AAAGAAATAA:2, 000000:ATAAAAATAA:2, 000000:TAAAAAAAAT:2, 000000:AAAACATAAA:2, 000000:AAAAAAAAGG:2, 000000:AAAAAAAGAG:2, 000000:AAGAACAAAA:2, 000000:AAAAAAAATT:2, 000000:TAAAAAACAA:2, 000000:GAAATAAAAA:2, 000000:AAATCAAAAA:2, 000000:ACAAACAAAA:2, 000000:ACAATAAAAA:2, 000000:CAATAAAAAA:2, 000000:AAAAAAACCA:2, 000000:TAAAAAGAAA:2, 000000:ATAAAAAAAT:2, 000000:AAAAATTAAA:2, 000000:AGAAAAAATA:2, 000000:ACAACAAAAA:2, 000000:CAAAAAAACA:2, 000000:AAAAAAATGA:2, 000000:AAATAGAAAA:2, 000000:ACAAAAAAAC:2, 000000:AAAAACATAA:2, 000000:AAAAATAACA:2, 000000:GAAGAAAAAA:2, 000000:CAAAAAATAA:2, 000000:AAAAATAAAT:2, 000000:GACAAAAAAA:2, 000000:ATATAAAAAA:2, 000000:AAAAAACAAC:2, 000000:AAAAACACAA:2, 000000:AAAAACAGAA:2, 000000:TAAAAAAAGA:2, 000000:AAACAACAAA:2, 000000:TAAAAAAAAC:2, 000000:AACAAATAAA:2, 000000:AACCAAAAAA:2, 000000:AAAAAAGACA:2, 000000:AAAGAAAATA:2, 000000:CTAAAAAAAA:2, 000000:CAAAGAAAAA:2, 000000:AGTAAAAAAA:2, 000000:AAAAAAGAAG:2, 000000:CAAATAAAAA:2, 000000:ACAAAAACAA:2, 000000:AAAAACGAAA:2, 000000:AAAAAGCAAA:2, 000000:CAAAAGAAAA:2, 000000:AAGGAAAAAA:2, 000000:AAAAAAAATG:2, 000000:AAAATAACAA:2, 000000:AAAAAGAAAT:2, 000000:AATAAAGAAA:2, 000000:AACAAGAAAA:2, 000000:ATAAAGAAAA:2, 000000:AGAAAAAAGA:2, 000000:AAAAATAATA:2, 000000:AAAACACAAA:2, 000000:AGAGAAAAAA:2, 000000:AGACAAAAAA:2, 000000:AAAATAAAAC:2, 000000:AAAGAAAAGA:2, 000000:AAAAAAGATA:2, 000000:AAAAAAACGA:2, 000000:AAAGAATAAA:2, 000000:AAAAAAATAG:2, 000000:AAAATAATAA:2, 000000:TAGAAAAAAA:2, 000000:AACAAAAACA:2, 000000:GAAAAAAATA:2, 000000:AAAAAAAGAC:2, 000000:AAAATACAAA:2, 000000:GAAAAACAAA:2, 000000:AAAAAAATAT:2, 000000:AAAAGAGAAA:2, 000000:AAAGAAAAAC:2, 000000:AATAACAAAA:2, 000000:AAAAGAAAAT:2, 000000:AAATACAAAA:2, 000000:AAAAATAAAG:2, 000000:AAATGAAAAA:2, 000000:AAATAAAATA:2, 000000:AAAAAATGAA:2, 000000:AAAAAGAAAC:2, 000000:AAAAGAAAAC:2, 000000:AATAAAACAA:2, 000000:AAAAAAGAAT:2, 000000:AAAAAAAAGT:2, 000000:AAAAAGAGAA:2, 000000:AAAAACCAAA:2, 000000:CAAAAAAATA:2, 000000:TAAGAAAAAA:2, 000000:AAAAATAAGA:2, 000000:AAAAAACAAG:2, 000000:AAACGAAAAA:2, 000000:GAAAAAAAAT:2, 000000:AACATAAAAA:2, 000000:AAAGTAAAAA:2, 000000:CAGAAAAAAA:2, 000000:ACAAAAAGAA:2, 000000:AAAAAAAGCA:2, 000000:AAGACAAAAA:2, 000000:AAAGCAAAAA:2, 000000:ACAAAAATAA:2, 000000:CAAAACAAAA:2, 000000:AGAAAATAAA:2, 000000:AACAAAACAA:2, 000000:AAAAAAACAT:2, 000000:AGAAAAAGAA:2, 000000:ATAAAACAAA:2, 000000:AAAAATACAA:2, 000000:AAAAAAAGTA:2, 000000:TAAAACAAAA:2, 000000:ACAAAAAAAG:2, 000000:TAACAAAAAA:2, 000000:GAAAAAAAAC:2, 000000:GGAAAAAAAA:2, 000000:TAAAGAAAAA:2, 000000:AGAAGAAAAA:2, 000000:AAATTAAAAA:2, 000000:AAACAAAAGA:2, 000000:AAATAAAAAG:2, 000000:AATAAAAAGA:2, 000000:AAATAAAAAT:2, 000000:AAAGAACAAA:2, 000000:TATAAAAAAA:2, 000000:AAGAAAAAGA:2, 000000:AAAAAAAGGA:2, 000000:ATAAAAAATA:2, 000000:TAAAAACAAA:2, 000000:AACAGAAAAA:2, 000000:AAAAACAAAC:2, 000000:AAAAAAAATC:2, 000000:CAAAAAAAGA:2, 000000:AAAAAATAGA:2, 000000:AAAAAAATAC:2, 000000:GAGAAAAAAA:2, 000000:AAAAAGTAAA:2, 000000:AACAAAGAAA:2, 000000:AAACAAGAAA:2, 000000:ATAAAAGAAA:2, 000000:ACAAAACAAA:2, 000000:AAAAAAACAG:2, 000000:GAAAAATAAA:2, 000000:TAAAATAAAA:2, 000000:GAAAAAAGAA:2, 000000:ACAAAAAACA:2, 000000:AAAAGTAAAA:2, 000000:AATAAAAAAC:2, 000000:AAAAAACAGA:2, 000000:AAAGAAAACA:2, 000000:GAAAACAAAA:2, 000000:GAATAAAAAA:2, 000000:AGAAACAAAA:2, 000000:AAAAGAAAAG:2, 000000:AATTAAAAAA:2, 000000:GTAAAAAAAA:2, 000000:TAAAAATAAA:2, 000000:AAAATAAACA:2, 000000:TAAAAAAAAG:2, 000000:AAGAAAACAA:2, 000000:TAAAAAATAA:2, 000000:AGGAAAAAAA:2, 000000:CAAAAACAAA:2, 000000:AAAAGAAATA:2, 000000:ACAGAAAAAA:2, 000000:AAAGAAGAAA:2, 000000:AAAAACAATA:2, 000000:AAAGAAAGAA:2, 000000:AAAAAAACTA:2, 000000:AAGAATAAAA:2, 000000:AATAAGAAAA:2, 000000:AAAAATCAAA:2, 000000:TAAAAAAATA:2, 000000:AAAAAATAAG:2, 000000:AGAAAAGAAA:2, 000000:AATAAAAATA:2, 000000:AAATAAAACA:2, 000000:ATCAAAAAAA:2, 000000:ATAAAAAAAC:2, 000000:GAAAAAAAGA:2, 000000:ATAAGAAAAA:2, 000000:AACAAAAATA:2, 000000:AAAGGAAAAA:2, 000000:ACAAGAAAAA:2, 000000:CATAAAAAAA:2, 000000:AGAACAAAAA:2, 000000:AAGAAAAAAT:2, 000000:GAAAGAAAAA:2, 000000:AATAATAAAA:2, 000000:GAAAAAACAA:2, 000000:AAACACAAAA:2, 000000:GAAAAAATAA:2, 000000:AAAAACAAAT:2, 000000:AAACAAAACA:2, 000000:AATAAAAACA:2, 000000:AAAGACAAAA:2, 000000:AAAATAGAAA:2, 000000:AAAACAAGAA:2, 000000:AAAATTAAAA:2, 000000:AAAACAGAAA:2, 000000:AAAAACAAAG:2, 000000:ATAAACAAAA:2, 000000:AGCAAAAAAA:2, 000000:AACAACAAAA:2, 000000:AAAGAAAAAG:2, 000000:AAAAAACAAT:2, 000000:AAAAGAAACA:2, 000000:AAACTAAAAA:2, 000000:AACGAAAAAA:2, 000000:CGAAAAAAAA:2, 000000:AAATAATAAA:2, 000000:AATAAATAAA:2, 000000:AGAAAAACAA:2, 000000:AAGAAGAAAA:2, 000000:AATATAAAAA:2, 000000:AAACAGAAAA:2, 000000:GCAAAAAAAA:2, 000000:AAAAACAAGA:2, 000000:AGAATAAAAA:2, 000000:ATAAAAAAGA:2, 000000:AAATAAACAA:2, 000000:AAAAAAAACT:2, 000000:AAAAATAGAA:2, 000000:AAGAAACAAA:2, 000000:AAAAAACACA:2, 000000:AAAATAAAAT:2, 000000:CAAAAAAGAA:2, 000000:AAAATGAAAA:2, 000000:AGAAAAATAA:2, 000000:AAAGATAAAA:2, 000000:AAAAAATAAC:2, 000000:AAAACAAAGA:2, 000000:TCAAAAAAAA:2, 000000:AATAGAAAAA:2, 000000:AAACAAAAAG:2, 000000:AAAACAAAAT:2, 000000:AGAAAAAACA:2, 000000:ATAAAATAAA:2, 000000:ATAAAAAGAA:2, 000000:AACAAAAAAG:2, 000000:AAAAAGAACA:2, 000000:AAAAGCAAAA:2, 000000:AAAAAAGAGA:2, 000000:AAAAAAATCA:2, 000000:ATAGAAAAAA:2, 000000:AAAAAACTAA:2, 000000:AAAACAAATA:2, 000000:ACGAAAAAAA:2, 000000:AAGATAAAAA:2, 000000:AAAAAGACAA:2, 000000:AAAACTAAAA:2, 000000:AAGAAAAAAG:2, 000000:AATAAAAAAG:2, 000000:AAAAAAGGAA:2, 000000:ACACAAAAAA:2, 000000:AAAATAAATA:2, 000000:AAAATAAAGA:2, 000000:AGATAAAAAA:2, 000000:GAACAAAAAA:2, 000000:CAAAAAAAAG:2, 000000:ATAATAAAAA:2, 000000:AACAAACAAA:2, 000000:AAAAAACATA:2, 000000:TGAAAAAAAA:2, 000000:GAAAAGAAAA:2, 000000:AAAATCAAAA:2, 000000:AGAAAAAAAC:2, 000000:TAAACAAAAA:2, 000000:AACAAAAGAA:2, 000000:AAAAAGGAAA:2, 000000:AACACAAAAA:2, 000000:AAAAAATACA:2, 000000:AAAAATAAAC:2, 000000:AAATAAAAAC:2, 000000:AAAAAAGTAA:2, 000000:AATAAAAAAT:2, 000000:AAAACAATAA:2, 000000:AAAAAAGCAA:2, 000000:AGAAATAAAA:2, 000000:GAAAAAAAAG:2, 000000:AATACAAAAA:2, 000000:AATCAAAAAA:2, 000000:ACCAAAAAAA:2, 000000:CAAAAAGAAA:2, 000000:AAGAAAATAA:2, 000000:AATAAAATAA:2, 000000:ATAACAAAAA:2, 000000:AAAACGAAAA:2, 000000:GAAAAAGAAA:2, 000000:AAATAACAAA:2, 000000:ACAAAAAAAT:2, 000000:AAAAATATAA:2, 000000:AAATAAGAAA:2, 000000:CCAAAAAAAA:2, 000000:AACAAAAAAT:2, 000000:ATAAATAAAA:2, 000000:CAAGAAAAAA:2, 000000:AAACAAAATA:2, 000000:AAAAAAAACC:2, 000000:AAAAAGAAGA:2, 000000:AAAACAAAAG:2, 000000:ATAAAAAAAG:2, 000000:AAATAAATAA:2, 000000:TAATAAAAAA:2, 000000:CAAAATAAAA:2, 000000:AAAAAACCAA:2, 000000:TAAATAAAAA:2, 000000:AGAAAGAAAA:2, 000000:AAAGAAAAAT:2, 000000:AAAAAAACAC:2, 000000:AAAAAGATAA:2, 000000:AAGAAAAAAC:2, 000000:CAAACAAAAA:2, 000000:GAAAAAAACA:2, 000000:AAACAAATAA:2, 000000:AAGAAAAATA:2, 000000:AGAAAAAAAG:2, 000000:AAAAGACAAA:2, 000000:AAAAAGAAAG:2, 000000:AAAAAATTAA:2, 000000:AAAAAAAAGC:2, 000000:AAATATAAAA:2, 000000:AACAAAAAAC:2, 000000:AAAAAAATTA:2, 000000:AATAAACAAA:2, 000000:AAAGAGAAAA:2, 000000:AAAAAATATA:2, 000000:ACAAAAGAAA:2, 000000:AAGAAAAGAA:2, 000000:TAAAAAAGAA:2, 000000:GATAAAAAAA:2, 000000:AAAAGATAAA:2, 000000:ATTAAAAAAA:2, 000000:AAAAATGAAA:2, 000000:CAAAAAAAAT:2, 000000:AAGAGAAAAA:2, 000000:AAAATAAAAG:2, 000000:AAAAGAATAA:2, 000000:AAAAGAACAA:2, 000000:AACAAAAAGA:2, 000000:AACTAAAAAA:2, 000000:ACAAATAAAA:2, 000000:TTAAAAAAAA:2, 000000:AAAAACTAAA:2, 000000:AAGAAAAACA:2, 000000:AAGCAAAAAA:2, 000000:TACAAAAAAA:2, 000000:AAAAAAAAAA:0, 000000:TAAAAAAACA:2, 000000:AAAAGAAGAA:2, 000000:AAAATATAAA:2, 000000:ATAAAAACAA:2, 000000:ACAAAATAAA:2, 000000:AAAACAACAA:2, 000000:CAAAAAACAA:2, 000000:CACAAAAAAA:2, 000000:AAACAATAAA:2, 000000:AAAACAAACA:2, 000001:ACAAAAAAGA:2, 000001:GAAAATAAAA:2, 000001:CAACAAAAAA:2, 000001:AAAACAAAAC:2, 000001:AAACATAAAA:2, 000001:AACAATAAAA:2, 000001:AGAAAACAAA:2, 000001:ATGAAAAAAA:2, 000001:CAAAAATAAA:2, 000001:AAAAAGAATA:2, 000001:AATAAAAGAA:2, 000001:AAACAAAAAC:2, 000001:AAAAAAAGAT:2, 000001:AGAAAAAAAT:2, 000001:AAACAAACAA:2, 000001:AAAAAAAACG:2, 000001:AAAGAAACAA:2, 000001:ATACAAAAAA:2, 000001:AAAAACAACA:2, 000001:AAACAAAGAA:2, 000001:ACAAAGAAAA:2, 000001:AAACAAAAAT:2, 000001:TAAAAGAAAA:2, 000001:AAAAAATCAA:2, 000001:AAAAAATAAT:2, 000001:AAAACCAAAA:2, 000001:AAATAAAGAA:2, 000001:ATAAAAAACA:2, 000001:AAGAAATAAA:2, 000001:AAACCAAAAA:2, 000001:AAAAGGAAAA:2, 000001:AAGAAAGAAA:2, 000001:AAGTAAAAAA:2, 000001:AAAAAAGAAC:2, 000001:ACAAAAAATA:2, 000001:AACAAAATAA:2, 000001:AATGAAAAAA:2, 000001:ACTAAAAAAA:2, 000001:CAAAAAAAAC:2, 000001:AAAAAACGAA:2, 000001:AAATAAAAGA:2, 000001:GAAACAAAAA:2, 000001:AAAAGAAAGA:2, 000001:AAAATAAGAA:2, 000001:ACATAAAAAA:2, 000001:AAAGAAATAA:2, 000001:ATAAAAATAA:2, 000001:TAAAAAAAAT:2, 000001:AAAACATAAA:2, 000001:AAAAAAAAGG:2, 000001:AAAAAAAGAG:2, 000001:AAGAACAAAA:2, 000001:AAAAAAAATT:2, 000001:TAAAAAACAA:2, 000001:GAAATAAAAA:2, 000001:AAATCAAAAA:2, 000001:ACAAACAAAA:2, 000001:ACAATAAAAA:2, 000001:CAATAAAAAA:2, 000001:AAAAAAACCA:2, 000001:TAAAAAGAAA:2, 000001:ATAAAAAAAT:2, 000001:AAAAATTAAA:2, 000001:AGAAAAAATA:2, 000001:ACAACAAAAA:2, 000001:CAAAAAAACA:2, 000001:AAAAAAATGA:2, 000001:AAATAGAAAA:2, 000001:ACAAAAAAAC:2, 000001:AAAAACATAA:2, 000001:AAAAATAACA:2, 000001:GAAGAAAAAA:2, 000001:CAAAAAATAA:2, 000001:AAAAATAAAT:2, 000001:GACAAAAAAA:2, 000001:ATATAAAAAA:2, 000001:AAAAAACAAC:2, 000001:AAAAACACAA:2, 000001:AAAAACAGAA:2, 000001:TAAAAAAAGA:2, 000001:AAACAACAAA:2, 000001:TAAAAAAAAC:2, 000001:AACAAATAAA:2, 000001:AACCAAAAAA:2, 000001:AAAAAAGACA:2, 000001:AAAGAAAATA:2, 000001:CTAAAAAAAA:2, 000001:CAAAGAAAAA:2, 000001:AGTAAAAAAA:2, 000001:AAAAAAGAAG:2, 000001:CAAATAAAAA:2, 000001:ACAAAAACAA:2, 000001:AAAAACGAAA:2, 000001:AAAAAGCAAA:2, 000001:CAAAAGAAAA:2, 000001:AAGGAAAAAA:2, 000001:AAAAAAAATG:2, 000001:AAAATAACAA:2, 000001:AAAAAGAAAT:2, 000001:AATAAAGAAA:2, 000001:AACAAGAAAA:2, 000001:ATAAAGAAAA:2, 000001:AGAAAAAAGA:2, 000001:AAAAATAATA:2, 000001:AAAACACAAA:2, 000001:AGAGAAAAAA:2, 000001:AGACAAAAAA:2, 000001:AAAATAAAAC:2, 000001:AAAGAAAAGA:2, 000001:AAAAAAGATA:2, 000001:AAAAAAACGA:2, 000001:AAAGAATAAA:2, 000001:AAAAAAATAG:2, 000001:AAAATAATAA:2, 000001:TAGAAAAAAA:2, 000001:AACAAAAACA:2, 000001:GAAAAAAATA:2, 000001:AAAAAAAGAC:2, 000001:AAAATACAAA:2, 000001:GAAAAACAAA:2, 000001:AAAAAAATAT:2, 000001:AAAAGAGAAA:2, 000001:AAAGAAAAAC:2, 000001:AATAACAAAA:2, 000001:AAAAGAAAAT:2, 000001:AAATACAAAA:2, 000001:AAAAATAAAG:2, 000001:AAATGAAAAA:2, 000001:AAATAAAATA:2, 000001:AAAAAATGAA:2, 000001:AAAAAGAAAC:2, 000001:AAAAGAAAAC:2, 000001:AATAAAACAA:2, 000001:AAAAAAGAAT:2, 000001:AAAAAAAAGT:2, 000001:AAAAAGAGAA:2, 000001:AAAAACCAAA:2, 000001:CAAAAAAATA:2, 000001:TAAGAAAAAA:2, 000001:AAAAATAAGA:2, 000001:AAAAAACAAG:2, 000001:AAACGAAAAA:2, 000001:GAAAAAAAAT:2, 000001:AACATAAAAA:2, 000001:AAAGTAAAAA:2, 000001:CAGAAAAAAA:2, 000001:ACAAAAAGAA:2, 000001:AAAAAAAGCA:2, 000001:AAGACAAAAA:2, 000001:AAAGCAAAAA:2, 000001:ACAAAAATAA:2, 000001:CAAAACAAAA:2, 000001:AGAAAATAAA:2, 000001:AACAAAACAA:2, 000001:AAAAAAACAT:2, 000001:AGAAAAAGAA:2, 000001:ATAAAACAAA:2, 000001:AAAAATACAA:2, 000001:AAAAAAAGTA:2, 000001:TAAAACAAAA:2, 000001:ACAAAAAAAG:2, 000001:TAACAAAAAA:2, 000001:GAAAAAAAAC:2, 000001:GGAAAAAAAA:2, 000001:TAAAGAAAAA:2, 000001:AGAAGAAAAA:2, 000001:AAATTAAAAA:2, 000001:AAACAAAAGA:2, 000001:AAATAAAAAG:2, 000001:AATAAAAAGA:2, 000001:AAATAAAAAT:2, 000001:AAAGAACAAA:2, 000001:TATAAAAAAA:2, 000001:AAGAAAAAGA:2, 000001:AAAAAAAGGA:2, 000001:ATAAAAAATA:2, 000001:TAAAAACAAA:2, 000001:AACAGAAAAA:2, 000001:AAAAACAAAC:2, 000001:AAAAAAAATC:2, 000001:CAAAAAAAGA:2, 000001:AAAAAATAGA:2, 000001:AAAAAAATAC:2, 000001:GAGAAAAAAA:2, 000001:AAAAAGTAAA:2, 000001:AACAAAGAAA:2, 000001:AAACAAGAAA:2, 000001:ATAAAAGAAA:2, 000001:ACAAAACAAA:2, 000001:AAAAAAACAG:2, 000001:GAAAAATAAA:2, 000001:TAAAATAAAA:2, 000001:GAAAAAAGAA:2, 000001:ACAAAAAACA:2, 000001:AAAAGTAAAA:2, 000001:AATAAAAAAC:2, 000001:AAAAAACAGA:2, 000001:AAAGAAAACA:2, 000001:GAAAACAAAA:2, 000001:GAATAAAAAA:2, 000001:AGAAACAAAA:2, 000001:AAAAGAAAAG:2, 000001:AATTAAAAAA:2, 000001:GTAAAAAAAA:2, 000001:TAAAAATAAA:2, 000001:AAAATAAACA:2, 000001:TAAAAAAAAG:2, 000001:AAGAAAACAA:2, 000001:TAAAAAATAA:2, 000001:AGGAAAAAAA:2, 000001:CAAAAACAAA:2, 000001:AAAAGAAATA:2, 000001:ACAGAAAAAA:2, 000001:AAAGAAGAAA:2, 000001:AAAAACAATA:2, 000001:AAAGAAAGAA:2, 000001:AAAAAAACTA:2, 000001:AAGAATAAAA:2, 000001:AATAAGAAAA:2, 000001:AAAAATCAAA:2, 000001:TAAAAAAATA:2, 000001:AAAAAATAAG:2, 000001:AGAAAAGAAA:2, 000001:AATAAAAATA:2, 000001:AAATAAAACA:2, 000001:ATCAAAAAAA:2, 000001:ATAAAAAAAC:2, 000001:GAAAAAAAGA:2, 000001:ATAAGAAAAA:2, 000001:AACAAAAATA:2, 000001:AAAGGAAAAA:2, 000001:ACAAGAAAAA:2, 000001:CATAAAAAAA:2, 000001:AGAACAAAAA:2, 000001:AAGAAAAAAT:2, 000001:GAAAGAAAAA:2, 000001:AATAATAAAA:2, 000001:GAAAAAACAA:2, 000001:AAACACAAAA:2, 000001:GAAAAAATAA:2, 000001:AAAAACAAAT:2, 000001:AAACAAAACA:2, 000001:AATAAAAACA:2, 000001:AAAGACAAAA:2, 000001:AAAATAGAAA:2, 000001:AAAACAAGAA:2, 000001:AAAATTAAAA:2, 000001:AAAACAGAAA:2, 000001:AAAAACAAAG:2, 000001:ATAAACAAAA:2, 000001:AGCAAAAAAA:2, 000001:AACAACAAAA:2, 000001:AAAGAAAAAG:2, 000001:AAAAAACAAT:2, 000001:AAAAGAAACA:2, 000001:AAACTAAAAA:2, 000001:AACGAAAAAA:2, 000001:CGAAAAAAAA:2, 000001:AAATAATAAA:2, 000001:AATAAATAAA:2, 000001:AGAAAAACAA:2, 000001:AAGAAGAAAA:2, 000001:AATATAAAAA:2, 000001:AAACAGAAAA:2, 000001:GCAAAAAAAA:2, 000001:AAAAACAAGA:2, 000001:AGAATAAAAA:2, 000001:ATAAAAAAGA:2, 000001:AAATAAACAA:2, 000001:AAAAAAAACT:2, 000001:AAAAATAGAA:2, 000001:AAGAAACAAA:2, 000001:AAAAAACACA:2, 000001:AAAATAAAAT:2, 000001:CAAAAAAGAA:2, 000001:AAAATGAAAA:2, 000001:AGAAAAATAA:2, 000001:AAAGATAAAA:2, 000001:AAAAAATAAC:2, 000001:AAAACAAAGA:2, 000001:TCAAAAAAAA:2, 000001:AATAGAAAAA:2, 000001:AAACAAAAAG:2, 000001:AAAACAAAAT:2, 000001:AGAAAAAACA:2, 000001:ATAAAATAAA:2, 000001:ATAAAAAGAA:2, 000001:AACAAAAAAG:2, 000001:AAAAAGAACA:2, 000001:AAAAGCAAAA:2, 000001:AAAAAAGAGA:2, 000001:AAAAAAATCA:2, 000001:ATAGAAAAAA:2, 000001:AAAAAACTAA:2, 000001:AAAACAAATA:2, 000001:ACGAAAAAAA:2, 000001:AAGATAAAAA:2, 000001:AAAAAGACAA:2, 000001:AAAACTAAAA:2, 000001:AAGAAAAAAG:2, 000001:AATAAAAAAG:2, 000001:AAAAAAGGAA:2, 000001:ACACAAAAAA:2, 000001:AAAATAAATA:2, 000001:AAAATAAAGA:2, 000001:AGATAAAAAA:2, 000001:GAACAAAAAA:2, 000001:CAAAAAAAAG:2, 000001:ATAATAAAAA:2, 000001:AACAAACAAA:2, 000001:AAAAAACATA:2, 000001:TGAAAAAAAA:2, 000001:GAAAAGAAAA:2, 000001:AAAATCAAAA:2, 000001:AGAAAAAAAC:2, 000001:TAAACAAAAA:2, 000001:AACAAAAGAA:2, 000001:AAAAAGGAAA:2, 000001:AACACAAAAA:2, 000001:AAAAAATACA:2, 000001:AAAAATAAAC:2, 000001:AAATAAAAAC:2, 000001:AAAAAAGTAA:2, 000001:AATAAAAAAT:2, 000001:AAAACAATAA:2, 000001:AAAAAAGCAA:2, 000001:AGAAATAAAA:2, 000001:GAAAAAAAAG:2, 000001:AATACAAAAA:2, 000001:AATCAAAAAA:2, 000001:ACCAAAAAAA:2, 000001:CAAAAAGAAA:2, 000001:AAGAAAATAA:2, 000001:AATAAAATAA:2, 000001:ATAACAAAAA:2, 000001:AAAACGAAAA:2, 000001:GAAAAAGAAA:2, 000001:AAATAACAAA:2, 000001:ACAAAAAAAT:2, 000001:AAAAATATAA:2, 000001:AAATAAGAAA:2, 000001:CCAAAAAAAA:2, 000001:AACAAAAAAT:2, 000001:ATAAATAAAA:2, 000001:CAAGAAAAAA:2, 000001:AAACAAAATA:2, 000001:AAAAAAAACC:2, 000001:AAAAAGAAGA:2, 000001:AAAACAAAAG:2, 000001:ATAAAAAAAG:2, 000001:AAATAAATAA:2, 000001:TAATAAAAAA:2, 000001:CAAAATAAAA:2, 000001:AAAAAACCAA:2, 000001:TAAATAAAAA:2, 000001:AGAAAGAAAA:2, 000001:AAAGAAAAAT:2, 000001:AAAAAAACAC:2, 000001:AAAAAGATAA:2, 000001:AAGAAAAAAC:2, 000001:CAAACAAAAA:2, 000001:GAAAAAAACA:2, 000001:AAACAAATAA:2, 000001:AAGAAAAATA:2, 000001:AGAAAAAAAG:2, 000001:AAAAGACAAA:2, 000001:AAAAAGAAAG:2, 000001:AAAAAATTAA:2, 000001:AAAAAAAAGC:2, 000001:AAATATAAAA:2, 000001:AACAAAAAAC:2, 000001:AAAAAAATTA:2, 000001:AATAAACAAA:2, 000001:AAAGAGAAAA:2, 000001:AAAAAATATA:2, 000001:ACAAAAGAAA:2, 000001:AAGAAAAGAA:2, 000001:TAAAAAAGAA:2, 000001:GATAAAAAAA:2, 000001:AAAAGATAAA:2, 000001:ATTAAAAAAA:2, 000001:AAAAATGAAA:2, 000001:CAAAAAAAAT:2, 000001:AAGAGAAAAA:2, 000001:AAAATAAAAG:2, 000001:AAAAGAATAA:2, 000001:AAAAGAACAA:2, 000001:AACAAAAAGA:2, 000001:AACTAAAAAA:2, 000001:ACAAATAAAA:2, 000001:TTAAAAAAAA:2, 000001:AAAAACTAAA:2, 000001:AAGAAAAACA:2, 000001:AAGCAAAAAA:2, 000001:TACAAAAAAA:2, 000001:AAAAAAAAAA:0, 000001:TAAAAAAACA:2, 000001:AAAAGAAGAA:2, 000001:AAAATATAAA:2, 000001:ATAAAAACAA:2, 000001:ACAAAATAAA:2, 000001:AAAACAACAA:2, 000001:CAAAAAACAA:2, 000001:CACAAAAAAA:2, 000001:AAACAATAAA:2, 000001:AAAACAAACA:2 %#% 11:37:56 ( 0.0000 : 0.0000) Processed 1 sequences Average +length: 11 Total fuzzy comparisons: 812 %#% 11:37:56 ( 0.0000 : 0.0000) Finish Total Elapsed Time: 0 secs **** WordSize: 10 StringSize: 1000 WordCount: 1 StringCount: 1000 Xor +: 0 Trie: 5.86920619010925 (0.03125 + 5.83795619010925) **** WordSize: 10 StringSize: 1000 WordCount: 1 StringCount: 1000 Xor +: 0 Trie: 5.86920619010925 (0.03125 + 5.83795619010925) %#% 11:37:57 ### xor_fuzz search done. %#% 11:37:57 ( 0.0156 : 0.0156) Finish Total Elapsed Time: 0.015625 secs [11:37:57.14] P:\test\demerphq>

    The explaination of why your code only finds two matches, and attributes them to exact matches and mine finds those two + 810 more?

    Simple. Your Trie implementation will only report the first match it finds at any given position. If you delete the first key in test.words 'AAAAAAAAAA', then it will still report finding 'AAAAAAAAAA'! because every other key in that file can be considered a fuzzy match for 'AAAAAAAAAA'.

    What this means is that your code is only doing a fraction of the work that mine is doing (and that is required). In the case of the 10-char keys above, you code is only doing 1/406th (2.5%) of the required work. By the time you get to 25-char keys, this drops to 1/2701th (0.03%).

    For the explaination of those numbers, you need only go back and read my analysis , and make some attempt to understand those "...totally nonsensical equations and arguments supposedly proving how my approach just cant work, ..."

    So, for your "...but hey it outperforms yours..." claims. If you only do a tiny fraction of what is required--sure you'll do that quicker. Even that claim is bogus if you cannot build the datastructure required to achieve your "performance".

  2. Update: Section removed. I screwed up. 'B' is not a part of the alphabet 'ACGT' under test.
  3. There are myriad other things wrong with both your algorithm and your implementation of it.
    • Your home-brew timing code incorporates it's own runtime into the things under test.
    • Your logging code--in particular, the code that limits the number of matches output, for no good reason other than beautification--imposes a substantial overhead on both our algorithms.
      • By accumulating the matches in an array, instead of outputting them as the are found.
      • By spending time formatting (push @time, join, @_, sub sprintf ... join, map{ sprintf } @matches ) them into a single line, rather than just printing them!

      If you get around to performing your timings without that code in place you'll find that not only does it markedly alter the results, but that the impact of it, due to the differences in our algorithms (and nesting levels at which it is done), is disproportionate. I'll let you work out for yourself which algorithm is most affected.

    • Your protoype -v- production, Perl --v- C claims "...imagine what a String/Struct/Inline::C/Perl hybrid would do. Using probably less than 1/100th of the ram..." (since silently modified.) are also totally without basis.

      Even if you could achieve a 100:1 reduction in the size of the datastructure required to hold your MegaTrie/DFA (a more realistic figure would be 20:1), then you would still be (at 7GB), nearly an order of magnitude 3 1/2 times over budget for the 2GB/process RAM addressable by your average 32-bit processor.

All in all--eminently worthy of my "Utterly bollocked" private /msg.


I could gone on, but then this would seem like a hatchet job. It isn't. I've expended the best part of this last week on this analysis, on top of the time I put into that I had already done. That you so readily dismissed as "...totally nonsensical equations and arguments...".

This is the 5th revision of this document. Each time I've attempted to remove the vitriol from it. Reading it back now I see I still probably haven't achieved that as well as people will expect. Sorry, but every time I read this and this, it gets me so hot under the collar that it just comes out. So be it, whatever is left will stand. If this amount of effort is rewarded with downvotes that's fine by me.

Two final comments:

  1. It is a truism that "War always causes an arms race; technology always advances faster under war than under peace".

    And so it is. As a result of your pushing, and the penny clicking that an optimisation alluded to in the OP (Option #2), I have now improved the performance of my algorithm, especially on long sequences (strings) by almost an order of magnitude compared to the code I previously published. That still means that the OPs stated requirements will take a substantial amount of time, but then they are substantial requirements.

  2. I probably wouldn't have pursued that optimisation had you not so incensed me.

    Imagine what we could do if we worked together, instead of against each other? Of course, it seems to be human nature that we (man in general; and you and I specifically), cannot make this transition.

    I think that I see a way to combine the XOR method with a (series of small) Tries such that it would achieve an even faster solution whilst staying within the realms of reality as far as memory consumption is concerned. I've implemented a generic, pure Perl Trie that uses substantially less memory (though it is slightly slower than yours).

    If things pan out, you might hear more on that. In any case, the Trie implementation has some rather interesting other uses, so you might here more about that anyway.


Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
  • Comment on Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really, really!)
  • Select or Download Code

Replies are listed 'Best First'.
Re^6: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really, really!)
by ysth (Canon) on Nov 28, 2004 at 19:56 UTC
    Even if you could achieve a 100:1 reduction in the size of the datastructure required to hold your MegaTrie/DFA (a more realistic figure would be 20:1), then you would still be (at 7GB), nearly an order of magnitude 3 1/2 times over budget for the 2GB/process RAM addressable by your average 32-bit processor.
    So do it in multiple passes, with as many keywords as fit comfortably in memory. As you say, the problem has substantial requirements.

    And surely the bulk of the problem is to find which offsets match; identifying the particular strings that match at a given offset, if this is indeed necessary, is much easier and can be a separate step.

      Okay. Let's assume for the moment that we've successfully par'ed down the memory consumption of the Trie to 1/20th of the prototypes usage. That means that we can now successfully build a Trie from 1000 keys instead of 50. For the 100,000 keys you need 100 passes of the 30,000 sequences.

      The performance of the Trie comes from trading the time consumed building the Trie, against the speed of scanning the sequences; and amortising the build time across the sequences scanned.

      But now we have to build 100 trees. Thus we've raised that overhead by 2 orders of magnitude. On top of that, we now have to scan all the sequences 100 times; multiplying the scan time by 2 orders of magnitude.

      If you take a look at the numbers (in the program output), you'll see that building a Trie with a single 25-char words takes around 8.5 seconds; and for 50 x 25-char words 491 seconds (9.84 sec/word). Based on those numbers*, you can expect building a Trie of 1000x25-char words to take at least 2 hrs 21 minutes (8.5secs * 1000). Building 100 of those will take seven months! And that's before we start scanning the 30,000 sequences 100 times each.

      *It is possible that this build-time might be reduced by moving to a C-implementations. It's also possible that it would not. Most of the time taken building the Trie comes not from pointer chasing, or anything else that Perl is significantly slower than C for. Most of the time stems from memory allocation and management. Perl is actually very good at this. Indeed, the fastest Perl builds are those that use Perls built-in malloc/free routines in preference to those in your local C-library. Perl does this well--and reliably. Any savings that might comes from implementing the memory management in C would come at the expense of reliability and at considerable develoment cost. Doing memory management in C is hard. Perl's routines have had many years of evolution. It would be the height of arrogance to think that you are going to do this better at your 1st or 2nd or even 5th attempt.

      identifying the particular strings that match at a given offset, if this is indeed necessary, is much easier and can be a separate step.

      Now you're faced with the task of comparing each of the matching substrings against each of the 100,000 keys and then determining if it is a 0, 1, or 2-mismatch for each of them.

      And how will you make the fuzzy-level determination? Fall back to the Xor method?

      The question of whether it is important to know what level of mis-match, each of the matched substrings matched with, is an open question; that only the OP can answer. I think it likely is, but I'm guessing.

      However, knowing which of the 100,000 keys the substring matched against is important. As I said before, I'm not a Biochemist or a Genetic Engineer, but I find it extrememly unlikely that only knowing that the 25 base-pairs at offset:3129 of sequence:29,001, matched(possibly exactly, possibly with one mis-match;; possibly with 2 mis-matches) against one of these 100,000 x 25-base pair 'bits', is likely to be of any great use in solving the original problem?

      This is the same conclusion I reached when I originally looked at using Tries for this problem in Re^3: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.). To quote myself:

      MegaTrie

      Now it is possible to combine the keys generated from more than one exact key, into a single Trie. At least in theory, this would allow the 100,000 exact keys + all their fuzzy matches to be combined into one MegaTrie. This approximates a DFA, by solving all 100,000 * 2700 (2-miss) "Does it match?" questions, in one go, for each 25-char substring. Reducing the problem to 976 * 30,000 = 29,280,000 lookups (here defined as a transition from Perl into C and back). When compared with the 954,528,000,000,000 otherwise required by the 2-miss scenario, this sounds very inviting.

      The problem is, that all you then know, is that one of the 100,000 exact matches, or one of their 270,000,000 possible fuzzy matches, did or did not match, each of the 29,280,000 substrings. So, you get to know something, (that the 25-char substring at position P of sequence S matched something)--very quickly-- but not what (which of the 100,000), nor why (exact, 1-miss or 2--miss).


      Examine what is said, not who speaks.
      "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
      "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re^6: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Sigh)
by demerphq (Chancellor) on Nov 29, 2004 at 07:42 UTC

    You havent proved anything here. (Except that you dont understand how my solution works, the list you built meant my code was trying to match all _4_ digit fuzzy strings, and not the _2_ digits we were originally discussing, this presumably is where you get the misconception that it wont handle 25 digit keys.) The behaviour you have posted here is exactly as I predicted. And its correct. As I stated in my earlier post and as ysth pointed out as well its trivial to determine what keys also match if a given key matches. So you havent proved anything here. I could have written the code so that instead of outputing only the literal string matched it would output a precalculated list of all the possible variants. Which actually brings me to yet another mathematical problem with your code. Your list omits the 30 words that have only 1 digit different.

    Point in fact you say my idea doesnt work. So I posted working code. Then you say its broken, when in fact it does exactly what it was advertised to do. Face it BrowserUk you are and were wrong. Your XOR approach is slow, so slow as to be useless for nontrivial searches. The state machine is fast, and where it has an up-front overhead can be reused over and over. No matter what you do you arent going to get around that. Sorry.

    I suggest you drop this, all you are doing is embarrasing yourself now.

    ---
    demerphq

      ...the list you built meant my code was trying to match all _4_ digit fuzzy strings, and not the _2_ digits we were originally discussing...
      01234567890 AAAAAAAAAA AAAAAAAAAAA ========== Offset 0 -- 0 mismatches. 01234567890 AAAAAAAAAA AAAAAAAAAAA ========== Offset 1 -- 0 mismatches. 01234567890 CCAAAAAAAA AAAAAAAAAAA xx======== Offset 0 -- 2 mismatches. 12345678901 CCAAAAAAAA AAAAAAAAAAA xx======== Offset 1 -- 2 mismatches. ## 806 other 2-mismatch matches your code fails to find 12345678901 AAAAAAAATT AAAAAAAAAAA ========xx Offset 0 -- 2 mismatches. 12345678901 AAAAAAAATT AAAAAAAAAAA ========xx Offset 1 -- 2 mismatches.

      QED. 2 not 4.

      Perhaps you should try this

      Update: And don't you realise that the list of 4 character mismatches would include all of the 2-char mismatches? And the 1-char mismatches? And the 3-char mismatches? And you code found what? Just TWO exact matches...pull the other one.


      Examine what is said, not who speaks.
      "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
      "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://410708]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2024-04-23 14:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found