Perl: the Markov chain saw PerlMonks

### Fuzzy Searching: Optimizing Algorithm Selection

by Itatsumaki (Friar)
 on Nov 10, 2004 at 23:23 UTC Need Help??

Itatsumaki has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm well-aware that fuzzy-matching (e.g. Levenstein and String::Approx) have been discussed here several times. I need to do a very large quantity of fuzzy matching, and I'm looking for some theoretical suggestions on how to start developing. This is to be design-time optimization, not premature optimization, I hope.

The basic problem:
Matching a 25-letter string against a dictionary of about 30 thousand variable length sequences. I need to find all:

• exact
• fuzzy (1- and 2-letter mismatches)
within the dictionary.

Performance matters because I have a library of several hundred thousand 25-letter strings. And because I'll be repeating this analysis for all pairwise combinations of about 10 search and 20 target dictionaries.

Three basic approaches come to mind for the fuzzy-matching:

1. Dynamically construct 25 separate regex's, one for each possible mismatch position. For instance:
```my @letters = split(//, \$string);
my \$i = 0;
my @regexes;
for (my \$j = 0; \$j < scalar(@letters); \$j++) (
for (my \$i = 0; \$i < scalar(@letters); \$k++) {
if (\$i != \$j) {
\$regexes[\$i] .= \$letters[\$j];
}
else {
\$regexes[\$i] .= '.{1}';
}
}
}
I'm sure that can be tuned substantially, just trying to give the general idea of what I had in mind.
2. On the other hand, because I know that my search sequence will always be 25 letters long I know that every fuzzy-by-one match will contain a 12-letter exact match, and any fuzzy-by-two match will contain an 7-letter exact match so I could index the large library for all possible 12-mers. I would then break each sequence into the 13 possible 12-mers and extend on either side looking for fuzzy-matches as above. The advantage of this approach would be that I am only searching regions of the library that have a theoretical chance of matching my string.
3. Finally, since I have large libraries of both search strings (25-letters) and target strings (variable length), it might be advantageous to just parse out all the possible 25-length strings from the target library once. Once I have this index, identifying exact or fuzzy matches would be trivial.

So you can see the reason I'm looking for suggestions now -- only one of these three approaches works on single-strings. The performance of the other two won't seem to scale with the search and target library sizes in any trivial way. I'm not too sure how to figure out which one will be optimal for any given situation.

Any suggestions on how to determine which might be optimal for a given situation? And are there other (potentially faster) approaches I might have missed?

One note: I *am* aware that BLAST can do option #2 above -- I'm more trying to determine which option will be faster to run than worrying about ease-of-implementation right now.

-Tats

Replies are listed 'Best First'.
Re: Fuzzy Searching: Optimizing Algorithm Selection
by tachyon (Chancellor) on Nov 11, 2004 at 00:29 UTC

The best approach may simply be a straight string scan fail fast approach. You can get this far more optimized than the RE engine could manage. It should certainly be in C for speed.

Breaking the dictionary into all 25 char long patterns serves no useful purpose - it simply increases the disk I/O by a factor of 25x. Ideally you would want the entire dictionary in memory anyway so you totally avoid any disk I/O bottlenecks.

The type of indexing you may or may not be able to do depends on the data. You can form a useful index based on the fact, that as you note, even an off by 2 match must have at least 7 consecutive matching chars - in fact I beleive it is impossible to have less than 8 as 25-2 = 23 Thus the worst case min lengths possible are 7+8+8=23 so you must have at least 2 8+ char stings that match.

If you only have 2 possible values in your strings the probability of randonly finding 8 consecutive chars in a given order is 2^8 or 1/256, If you are talking 4 chars ie CGAT then it is 1/65536. This should dramatically shrink the search space as you only need to scan 17 chars upstream and downstream of this sequence. Each search string can be broken into 18 8 char possibles. The problem with this may be (assuming CGAT alphabet) and 100 search strings you will have nearly 1800 unique index strings giving an average spacing in the search string of 65536/1800 ~ 36 chars. Now we have to search 17+8+17 = 42 chars so in a nutshell you still probably have to scan the whole search string every time. The advantage is limited to having to perform the scan on a limited subset of the find strings. There is of course a price for the hashing and lookup. If the alphabet (A) is larger than 4 chars then you should see very significant benefits due the to relative scarcity of A^8 substrings. I would quite possibly be worth precomputing the hash values for all the 8 char strings in your find an search spaces to speed your many to many search.

Re: Fuzzy Searching: Optimizing Algorithm Selection
by tall_man (Parson) on Nov 11, 2004 at 00:06 UTC
Your first idea might be improved by using Regexp::Optimizer to combine all the patterns into one:
```use strict;
use Regexp::Optimizer;

my \$string = "abcdefghijklmnopqrstuvwxy";
my @letters = split(//, \$string);
my \$i = 0;
my @regexes;
for (my \$j = 0; \$j < scalar(@letters); \$j++) {
for (my \$i = 0; \$i < scalar(@letters); \$i++) {
if (\$i != \$j) {
\$regexes[\$i] .= \$letters[\$j];
}
else {
\$regexes[\$i] .= '.';
}
}
}

my \$rl  = Regexp::Optimizer->new;
my \$re = \$rl->list2re(@regexes);
print "re: \$re\n";
my \$r = qr/\$re/;

my \$tomatch = "wwwabcdefghjjklmnopqrstuvwxyqq";
if (\$tomatch =~ \$r) {
print "it matched\n";
}
Re: Fuzzy Searching: Optimizing Algorithm Selection
by BrowserUk (Pope) on Nov 11, 2004 at 02:29 UTC

If you XOR two strings together,

```print join'',unpack 'C*',
'AGGCGGAAGCACCCAACAGCAACAG'
^ 'ACACGCAAAAACCGAAGAGAAAGCG';

0460040062000400400200420

The resultant string will be the null character (chr( 0 )) in each position where the two strings match and some other char where they did not.

If you count the non-null bytes, it gives you a score for the fuzziness of the match.

```print \$_ = grep \$_,unpack 'C*',
'AGGCGGAAGCACCCAACAGCAACAG'
^ 'ACACGCAAAAACCGAAGAGAAAGCG';

10

To avoid having to process the large file of sequences multiple times, it is better to process all of the 25-ers against each sequence in turn. By using substr to step down the sequence, you can perform the XOR against each of the 25-ers for each position of the sequence.

Putting that all together, I got:

```#! perl -slw
use strict;
use bytes;
\$| = 1;

our \$FUZZY ||= 10;

open FUZ, '< :raw', \$ARGV[ 0 ] or die "\$ARGV[ 0 ] : \$!";
my %fuz;
\$fuz{ \$_ } = '' while chomp( \$_ = <FUZ> );
close FUZ;

print "Loaded \${ \scalar keys %fuz } 25-ers";

open SEQ, '< :raw', \$ARGV[ 1 ] or die "\$ARGV[ 1 ] : \$!";

while( my \$seq = <SEQ> ) {
chomp \$seq;

for my \$offset ( 0 .. length( \$seq ) - 25 ) {
printf "\rProcessing sequence %5d offset %05d", \$., \$offset;

for my \$fuz ( keys %fuz ) {
my \$m =  grep \$_, unpack 'C*', \$fuz ^ substr( \$seq, \$offse
+t, 25 );

if( \$m <= \$FUZZY ) {

## This stores the lineno/offset/fuzziness where each
+25-er
##  matched in a compact form for further process; sor
+ting etc.
\$fuz{ \$fuz } .= pack 'nnn', \$., \$offset, \$m;

## Or just print out the data to a file.
#                print "\nMatched '\$fuz' -v- '",
#                    substr( \$seq, \$offset, 25 ),
#                "' in line: \$. @ \$offset with fuzziness: ", \$m;
}
}
}
}

close SEQ;

This process is currently running 500,000 25-ers against a file 30,000 x ave. 1000 characters (500 - 1500) sequences (all randomly generated), at the approximate rate of 3.5 seconds per offset, whilst recording the lineno/offset/fuzziness for all matches where the fuzziness is 10 or less. The advantage may be that this process will give you an accurate fuzziness factor for every 25-er matched against every position in a single pass.

I don't have any feel for how this compares to using the regex engine to perform the matching, but I think it may run quicker.

If you could indicate how many 100s of 1000s of 25's you are playing with, plus the average length of the 30,000 sequences, it would be possible to calculate an approximate time for the process that you could then compare with other approaches?

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

At 3.5 seconds per offset it will complete in roughly 3 years ;-) Nonetheless XOR works quickly in Perl. A small optimisation is to take the substr outside the inner keys loop as you are doing it 499,999 more times than required per offset.

Real sample data would be nice. The benefits from indexing depend on the alphabet size and the number of find strings.

cheers

tachyon

Update: Indeed. The two optimisations reduce the 500,000 comparisons time from ~3.5 seconds to + .5 of a second. That reduces the projected overall runtime of my test scenario from 3+ years to under half a year. Worth doing :)

Yes. I performed the same calculation based upon my decision to use 500,000 25-ers. Hence my decision to ask for clarification.

There are several ways to speed up the processing. Using

```( substr( \$seq, \$offset, 25 ) ^ \$25er ) =~ tr[\0][\0];

to do the counting, rather than

```grep \$_, unpack 'C*, ...

is (probably) another.

I'm just waiting for a limited benchmark I have running to complete before trying several things.

I had thought that by avoiding actually copying each 25 char string out of the sequence I might save some time/memory, but now you've pointed it out, I realise that I can create an LVALUE ref to the substring and avoid 500,000 calls to substr for each inner loop. Thanks.

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
There are proprietary libraries (in C) that do this sort of thing in a couple milliseconds per query string. The basic idea is that you use your dictionary of sequences to construct a finite automaton, which either accepts or rejects your query string. This does require some cleverness to do fuzzy matching, and I'm afraid I don't know enough of the details to help you out there.

A DFA implmentation is the next step from trie. In fact a trie can be viewed as an alternative representation of a DFA with anchoring at the string start.

---
demerphq

In case Itatsumaki should ever come back. Here's a somewhat improved implementation of my Xor algorithm. The original projection of 3 1.2 years runtime to process 100,000 x 1,000 x 30,000 is now reduced to 7.6 days:

Update (2004/12/08): Updated code to correct an error that manifested itself when the sequences were not an exact multiple of the key length. (As noted below by demerphq)

```#! perl -slw
use strict;
use bytes;

our \$FUZZY ||= 2;

open KEYS, '<', \$ARGV[ 0 ] or die "\$ARGV[ 0 ] : \$!";
my @keys = <KEYS>;
close KEYS;
chomp @keys;
warn "Loaded \${ \scalar @keys } keys";

open SEQ, '<', \$ARGV[ 1 ] or die "\$ARGV[ 1 ] : \$!";

my \$totalLen = 0;
my \$count = 0;

while( my \$seq = <SEQ> ) {
chomp \$seq;
my \$seqLen = length \$seq;
\$totalLen += \$seqLen;

for my \$key ( @keys ) {
my \$keyLen     = length \$key;
my \$mask       = \$key x ( int( \$seqLen / \$keyLen ) + 1 );
my \$minZeros   = chr( 0 ) x int( \$keyLen / ( \$FUZZY + 1 ) );
my \$minZlen       = length \$minZeros;

for my \$offset1 ( 0 .. \$keyLen-1 ) {

\$pos = 0;
while(
\$pos = 1+index  \$masked, \$minZeros, \$pos
) {
\$pos--;
my \$offset2 = \$pos - (\$pos % \$keyLen );
last unless \$offset1 + \$offset2 + \$keyLen <= \$seqLen;

my \$fuz = \$keyLen
- ( substr( \$masked, \$offset2, \$keyLen ) =~ tr[\0]
+[\0] );
if( \$fuz <= \$FUZZY ) {
printf "\tFuzzy matched key:'\$key' -v- '%s' in lin
+e:"
.  "%2d @ %6d (%6d+%6d) with fuzziness: %d\n",
+
substr( \$seq, \$offset1 + \$offset2, \$keyLen ),
\$., \$offset1 + \$offset2, \$offset1, \$offset2, \$
+fuz;
}
\$pos = \$offset2 + \$keyLen;
}
}
}
}
warn "\n\nProcessed \$. sequences";
warn "Average length: ", \$totalLen / \$.;
close SEQ;

A coupe of runs (on single sequences for timing purposes) on data comparable (produced by the same code) as other timings published elsewhere:

```[ 6:57:56.46] P:\test\demerphq>
..\406836-3
Fuzz-Words-W0025-S100000-WC100000-SC0001.fuzz
Fuzz-Strings-W0025-S100000-WC100000-SC0001.fuzz

Loaded 100000 keys at P:\test\406836-3.pl line 12.
seq:00001 (100000)
1 @  69364 (    14+ 69350) with fuzziness: 0
1 @  24886 (    11+ 24875) with fuzziness: 0
1 @  40056 (     6+ 40050) with fuzziness: 0
1 @  68870 (    20+ 68850) with fuzziness: 0
1 @   3264 (    14+  3250) with fuzziness: 0
1 @   8744 (    19+  8725) with fuzziness: 0
1 @   7493 (    18+  7475) with fuzziness: 0
1 @  28209 (     9+ 28200) with fuzziness: 0
1 @  91337 (    12+ 91325) with fuzziness: 0
1 @  63018 (    18+ 63000) with fuzziness: 0
1 @  61025 (     0+ 61025) with fuzziness: 0
1 @  32114 (    14+ 32100) with fuzziness: 0
1 @  30461 (    11+ 30450) with fuzziness: 0
1 @  59174 (    24+ 59150) with fuzziness: 0
1 @  74084 (     9+ 74075) with fuzziness: 0
1 @  58322 (    22+ 58300) with fuzziness: 0
1 @  78465 (    15+ 78450) with fuzziness: 0
1 @  56190 (    15+ 56175) with fuzziness: 0
1 @  14968 (    18+ 14950) with fuzziness: 0
1 @  31986 (    11+ 31975) with fuzziness: 0
1 @  60748 (    23+ 60725) with fuzziness: 0
1 @  93369 (    19+ 93350) with fuzziness: 0
1 @   6242 (    17+  6225) with fuzziness: 0
1 @  15282 (     7+ 15275) with fuzziness: 0
1 @  13293 (    18+ 13275) with fuzziness: 0

Processed 1 sequences at P:\test\406836-3.pl line 57, <SEQ> line 1.
Average length: 100000 at P:\test\406836-3.pl line 58, <SEQ> line 1.

[ 7:28:22.37] P:\test\demerphq>

[ 8:36:32.71] P:\test\demerphq>
..\406836-3
Fuzz-Words-W0025-S1000-WC100000-SC0010.fuzz
Fuzz-Strings-W0025-S1000-WC100000-SC0010.fuzz

Loaded 100000 keys at P:\test\406836-3.pl line 12.
seq:00001 (01000)
1 @     94 (    19+    75) with fuzziness: 0
1 @    692 (    17+   675) with fuzziness: 0
1 @    326 (     1+   325) with fuzziness: 0
1 @     35 (    10+    25) with fuzziness: 0
1 @    826 (     1+   825) with fuzziness: 0
1 @    598 (    23+   575) with fuzziness: 0
1 @    860 (    10+   850) with fuzziness: 0
1 @    489 (    14+   475) with fuzziness: 0
1 @    370 (    20+   350) with fuzziness: 0
1 @    745 (    20+   725) with fuzziness: 0
1 @    297 (    22+   275) with fuzziness: 0
1 @    415 (    15+   400) with fuzziness: 0
1 @    119 (    19+   100) with fuzziness: 0
1 @    957 (     7+   950) with fuzziness: 0
1 @    646 (    21+   625) with fuzziness: 0
1 @    779 (     4+   775) with fuzziness: 0

Processed 1 sequences at P:\test\406836-3.pl line 57, <SEQ> line 1.
Average length: 1000 at P:\test\406836-3.pl line 58, <SEQ> line 1.

[ 8:36:54.42] P:\test\demerphq>

Examine what is said, not who speaks.
"But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

As far as I can tell this code fails the "A" x 10/"A" x 11 test that you challenged my code with in a different thread. I was able to fix the bug by changing the if (\$fuz <= \$FUZZY) test to the following:

```  if( \$fuz <= \$FUZZY and \$offset1+\$offset2+\$keyLen<=\$seqLen) {

I have put together a test harness and framework for developing Fuzzy::Matcher implementations which I will post in a new thread (when i get sufficient time) as IMO this one has become too polluted with acrimony to be worth continuing. Your code as massaged to fit into this framework (along with the baseclass) is in the following readmore. Note that the test harness monitors memory utilization post-prepare() which is why the default prepare() is the way it is (to reduce memory overhead).

---
demerphq

Re: Fuzzy Searching: Optimizing Algorithm Selection
by ikegami (Pope) on Nov 11, 2004 at 00:14 UTC

My solution builds a regexp that looks like the following, minus the lines breaks:

```^
(?:a|(?(?{ local \$fuzzies = \$fuzzies; !\$fuzzies-- })(?=A)(?=Z)).)
(?:p|(?(?{ local \$fuzzies = \$fuzzies; !\$fuzzies-- })(?=A)(?=Z)).)
(?:p|(?(?{ local \$fuzzies = \$fuzzies; !\$fuzzies-- })(?=A)(?=Z)).)
(?:l|(?(?{ local \$fuzzies = \$fuzzies; !\$fuzzies-- })(?=A)(?=Z)).)
(?:e|(?(?{ local \$fuzzies = \$fuzzies; !\$fuzzies-- })(?=A)(?=Z)).)
(?{ \$fuzzies_left = \$fuzzies })
\$

Here it is:

```our \$USE_A_FUZZY =
'(?(?{ local \$fuzzies = \$fuzzies; !\$fuzzies-- })(?=A)(?=Z))';

our \$COUNT_FUZZIES_LEFT =
'(?{ \$fuzzies_left = \$fuzzies })';

{
my \$string = 'apple';
my \$max_fuzzies = 2;

my \$re = '^';
\$re .= sprintf("(?:%s|\${USE_A_FUZZY}.)", quotemeta(\$1))
while (\$string =~ /(.)/g);
\$re .= "\${COUNT_FUZZIES_LEFT}\\$";

\$re = do { use re 'eval'; qr/\$re/ };

our \$fuzzies;
local \$fuzzies;

our \$fuzzies_left;
local \$fuzzies_left;

while (<DATA>) {
chomp;
\$fuzzies = \$max_fuzzies;
if (/\$re/) {
\$fuzzies = \$max_fuzzies - \$fuzzies_left;
if (\$fuzzies) {
print("Fuzzy match with \$_ (\$fuzzies fuzzies).\$/");
} else {
print("Exact match with \$_.\$/");
}
} else {
print("No match with \$_.\$/");
}
}
}

__DATA__
apple
arple
brple
brqle
orange

------ OUTPUT ------
Exact match with apple.
Fuzzy match with arple (1 fuzzies).
Fuzzy match with brple (2 fuzzies).
No match with brqle.
No match with orange.

Don't know if it's faster.

Known Bugs: 'apples' doesn't count as a fuzzy match for 'apple', but it would count as a fuzzy match for 'apple '. I can fix that if you want.

Update: Since we never have any backtracking, we can simplify the above to:

Since his dictionary strings are variable-length, I don't think you can use begin and end anchors. For example, "I ate an apple." should match, but it doesn't with your code.

I must have missed that part of the problem description. The first version should work fine without the anchors.

There is a problem with removing the anchors however (and not just in my solution, I think). It would be possible to find a match with fuzzies while there's an exact match later on in the string.

Re: Fuzzy Searching: Optimizing Algorithm Selection
by johnnywang (Priest) on Nov 11, 2004 at 01:34 UTC
Update: realized this is the same idea as tachyon's Re: Fuzzy Searching: Optimizing Algorithm Selection

How about something simple minded: for each substring of the target, compute the difference with the search string, if the diff is less than 3, accept it. This will only find out matches that have at most two letter difference (but not missing):

```use strict;

my \$search = "apple";
my \$length = 5;  #length of the target
my \$FUZZY = 2; #
my \$BIG = 10;
while(my \$a = <DATA>){
chomp \$a;
my \$score = \$BIG;
for(my \$i = 0; \$i + \$length < length(\$a); \$i++){
\$score = diff_score(\$search, substr(\$a,,\$i,\$length));
if(\$score <= \$FUZZY){
print "\$score: \$a\n";
last;
}
}
if(\$score > \$FUZZY){
print "-1:\$a\n";
}
}

# return the number of different letters.
sub diff_score{
my (\$first, \$second)= @_;

return \$BIG unless length(\$first) == length(\$second);
my \$score = 0;
for(my \$i = 0; \$i < length(\$first); ++\$i){
++\$score if substr(\$first,\$i,1) ne substr(\$second,\$i,1
+);
}
return \$score;
}

__DATA__
This is an apple
An arple is here
What bpple is that
What  is that
What bple is that
aplle is not a word
Output as follows, diffs are in front, -1 show not matching fuzzily.
```__OUTPUT__
0: This is an apple
1: An arple is here
1: What bpple is that
-1:What  is that
2: What bple is that
1: aplle is not a word
Re: Fuzzy Searching: Optimizing Algorithm Selection
by dragonchild (Archbishop) on Nov 11, 2004 at 00:50 UTC
My first thuoght was to use a database (as it seems to always be). Oracle has the Context engine and MySQL has FULLTEXT indices. But, unfortunately, MySQL's FULLTEXT engine is based around searching paragraphs and Oracle's engine is as well, plus being really really hard to work with. But, we can still harness the power of the RDBMS here, just with a little thinking and some pre-processing.

Translating what you're doing into set theory, you're attempting to do something like

```SELECT string
FROM some_table
WHERE string = ?
OR off_by( 1, string, ? )
OR off_by( 2, string, ? )

Obviously, the issue is the off_by() function is what we're searching for. But, what we're really looking for is the set of strings that are off-by-N. So, we create additional columns. We would need a table something like:

```string char(25)
match  char(25)
off_by int(1)
Then, we can do something like
```SELECT string
FROM some_table
WHERE string = ?
OR ( match = ? and off_by = 1 )
OR ( match = ? and off_by = 2 )
Some intelligent indexing and we're off to the races! Assuming you have enough storage, this solution returns in near-constant time.

Of course, the key is storage. You're talking about 25 character strings. Let's assume we're restricted to A-Z only (which will help a bunch). So, you need one row for off_by=0. off_by=1 means you need 25 * 25, or 625 rows. off_by=2 means you need an additional 625 * 24 * 25 rows, or 375_000 rows. So, to store off_by=1 and off_by=2, you need a total of 375_626 rows for each string. At 51 bytes per row, or 18.26MB. Per string you want to store. That's not good.

However, there are a number of optimizations one can do. The easiest one is that many of the strings in your dictionary are going to be off_by's of other strings. So, you are already storing the off_by, if you could only link the two. So, some sort of XREF table may be in order. You have one table linking the string to some integer ID. Then, you have a cross-reference containing the IDs of the two strings and their off_by value.

Now, this doesn't get you _all_ the different off_by values, but you can do something else - you can store all your words, then pre-compute their off_by=1 and off_by=2. You can then store a third column in the main table as to which dictionary(s) this word belongs to. (This accidentally solves your multiple dictionary issue.)

Now, we're still trading space for time. In this case, a LOT of space for, potentially, quite a bit of time. But, it's still a design decision I would take a serious look at. Doing some sort of pre-calculation like I'm describing would bring you to, at worst, some O(logN), and probably very close to O(1).

Being right, does not endow the right to be rude; politeness costs nothing.
Being unknowing, is not the same as being stupid.
Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Fuzzy Searching: Optimizing Algorithm Selection
by tachyon (Chancellor) on Nov 11, 2004 at 03:56 UTC

AGREP (approximate grep) probably does what you want and the algorithms are outlined on the site, plus you can get the source code. A variation built around this code may well be as fast as it gets. Here is a postscript research paper on it

cheers

tachyon

I also recommend use of agrep. The only caveat is the restrictions on free use for commercial applications. I don't believe there is anything more efficient or better suited. The link that tachyon and I point to has links to other libraries and applications.
perlcapt
-ben

From a fairly quick perusal of the options, I don't think agrep will help much, except maybe as a pre-filter.

• It won't report where in a line a a match was found.
• It stops matching against a given line when it finds the first match.
• If you supply a file of things to match, it doesn't tell you which one matched.

Maybe I missed some things in amongst the six help 'screens'?

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
I went to the University of Arizona, and as an undergraduate, I would sit in on masters- and postdoc- level classes where folks were discussing stuff like agrep.

Somewhere in my files I have the follow-up to this paper, which allows for affine weighting for various symbols. For example, you might say that vowels are more interchangeable than consonants, if you're looking for fuzzy matches in the pronunciation problem space. I think the author of that paper went on into bioinformatics in a big way after that.

--
[ e d @ h a l l e y . c c ]

Re: Fuzzy Searching: Optimizing Algorithm Selection
by BrowserUk (Pope) on Nov 12, 2004 at 02:33 UTC

Just in case Itatsumaki should happen back this way :)

Let's put this problem into a little perspective.

#### Using the regex engine

To use regex to test for all of:

• An exact match.
• Every possible 1-char mismatch.
• Every possible 2-char mismatch.

Requires 1 + 25 (N) + 300 (25c2) = 326 different regexes.

Applying this mechanism to 1,000 25-ers, across 1,000 sequences of 1000-chars requires 326,000,000 calls to the regex engine[1].

If Perl could invoke the regex engine 10 times / millsecond[2], that would take just over 9 hours.

If you extend this to 100,000 25-ers, applied to 30,000 sequences of 1000-chars each, then the runtime will be just over 3 years!

If you double the number of 25-ers, you double the time.

If you double the average length of the strings, you slightly less than double the time.

However, if you allow for up to 3 mismatch characters, the number of regex required goes from 326 to 2626, and you have multiplied the time required by a factor of 8 giving the total time required to just under 25 years.

Allowing for 4 mismatches and you're in the region of 145 years.

Note: In this simplified scenario, no attempt has been made to record where the matches are found, nor which level of fuzziness matched.

#### Building an index of substrings

Starting with the premise that every 1-miss match must contain at least a 12-character exact match. Using the same 100,000 x 30,000 x 1000 scenario from above. To index all the 12 character substrings would require 1000 - 11 (989) x 30,000 index entries containing the 12-byte substring + 2-byte (min) sequence number + 2-byte (min) offset. That's 450 MB of index which should easily fit into memory on any half way decent machine.

But, in it's raw form, it doesn't buy you much. Even sorted, and using a binary chop to look up your data is going to take upto 19 chops to locate each item. At say 10 microsecond per chop (way optimistic) this doesn't sound too bad. 100,000 25-ers, break up into 14 x 12-char substrings each. Giving 14 * 100,000 * 19 = 266 seconds spent on lookup.

But what have we achieved? We may have succeeded in rejecting some of the sequences as not containing any of the 25-ers--but it is unlikely. The chances that a 1000 character string made up from A,C,G or T will contain any given 25-char substring is around 67 million times less likely than it will contain a given 12-char substring.

That's difficult to comprehend, but by reducing the length of the substring from 25 to 12, you are 67 million time less likely to be able to reject a given 1000-char sequence on the basis of it not containing the 12-char substring! Reduce the string to 7 or 8 chars (for the 2-misses match) and your effective rate of elimination drops through the floor.

And once we have located the index entry that corresponds to the 12-byte sequence, we will still need to use the sequence number/offset information to go out and read a 38 byte substring from the sequence in order to verify the match!

``` 1 2 3 4 5 6 7 8 9   1 2 3 4 5 6 7 8 9   1 2 3 4 5 6 7 8 9   1 2 3 4 5
+ 6 7 8
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
++-+-+-+
| | | | | | | | | | | | |?|X|X|X|X|X|X|X|X|X|X|X|X|?| | | | | | | | |
+| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
++-+-+-+

If the Xs represent our known match 12-chars, the other 13-chars of our match could be at either end. Hence the need to read 38 bytes from the sequence. But each of our 100,000 25-ers could fit around those 12-bytes in 14 different ways. That's fourteen different regexes we need to apply to that 38-byte substring (that needed to be read from a variable length file!) for every one of our 29,000,000+ 12 bytes substrings that matched.

Sorry, but the math involves probabilities here and the whole lot got beyond my ability to keep straight.

The upshot is that for even a 1-miss match, the index has done little or nothing to reduce the work required, and in many ways has increased it substaintially. By the time you get to 2-missed char matches, the numbers become astronomical.

This option goes nowhere fast.

#### "Use a database!"

Indexing via a database (of any form) simply exascerbates the above problems. The size of the database required to index a 30,000 x 1000-char sequences is huge. Building the indexes would take on the order of weeks, and the recall would be in the order of seconds. And for all that done, you still have to do most of the regex work yourself locally.

If anyone feels like contesting this conclusion, please show me the code--on realistically sized datasets. I may not be the worlds greatest DBA, but my limited attempts to explore the possibilities of using a DB and SQL to achieve this indicate that it would be 2 or 3 orders of magnitude slower than the flat files and regex solution above.

That is to say, I estimate that the 100,000 x 30,000 x 1000 x 1 mismatch scenario above would require 30 years of processing time. And that does not include the time required to build the DB.

#### AGREP and similar utilities

I spent an inordinate amount of time readng the help and playing with 2 different versions of agrep. My conclusion is that (as I indicated here), there is no way to utilise this tool for this purpose. It simply does not have any options to retrieve the sequence number/offset/fuzziness information required.

If anyone thinks they know better, I'd be all too pleased to see the working solution.

#### Stringy XOR

Applying the XOR method I describe elsewhere, the 1000 x 1000 x 1000 scenario I described takes 28.5 minutes.

```[ 6:53:31.34] P:\test>406836.pl 406836.25s.1000 406836.seq.1000 >40683
+6.results
Loaded 1000 25-ers at P:\test\406836.pl line 17.
Processing sequence  1000 offset 01238

Processed 1000 sequences at P:\test\406836.pl line 48, <SEQ> line 1000
+.
Average length: 1016.119 at P:\test\406836.pl line 49, <SEQ> line 1000
+.
Total fuzzy comparisons: 992119000 at P:\test\406836.pl line 50, <SEQ>
+ line ...

[ 7:22:03.34] P:\test>
```

Code at end

That gives a time of 1.723 microseconds per 25x25 fuzzy comparison.

Applied to the 100,000 x 30,000 x 1,000 scenario, that would require 59 days, 9 hours.

That's just over 5% of the time (or 1,845% quicker) than the most optimistic estimate for the best of the above.

And the real trick is that if you increase the number of mis-matched characters to 3 or 10 or 20, the time required remains the same.

For the 3-mismatch scenario, that means 0.65 % of the runtime or 15,000 % faster.

For the 4-mismatch scenario, that means 0.11 % of the runtime or 89,000 % faster.

After that, the numbers begin to get silly :)

Further, I think that the XOR method could be improved. If the sequences and 25-ers are limited to the ACGT patterns, rather than containing other characters representing unknowns (Xs) and other letters representing proteins(?) and stuff, then you could pack 4 bases into each character and reduce the length of the strings by a factor of 4.

The downside is that you have more work to determine the fuzziness factor. So, maybe not.

#### BLAST and similar tools.

I'm not really qualified to comment as I do not understand the documentation for the BioPerl bundle. What I will say is that given the depth and complexity of the hierarchy of modules; the known overhead of method lookups combined with the less than sparkling performance of Perl's sub calling, I fail to see how anything in the bundle would come close to outperforming the regex engine and flat files.

Maybe some BioPerl afficionado will happen by and post a 20 line solution that outstrips everything here--but I installed the bundle a while ago, and I couldn't even work out where to start. Then again, I'm not a biochemist.

[1]Yes. This number of individual calls to the regex engine could be reduced by combining the regexes into one big regex using PreSuf or similar tools, but given the nature of the regexes involved, the complexity of the output from combining 326 regexes into one big regex is likely to slow you down rather than speed things up.

To see what I mean, try generating the 326 regexes required for the 2-mismatches in a 25 character ACGT string. The OPs post has code to do this. The combine them into a single alternate-| regex. Then run your favorite regex optimiser. Then time a few iterations of the result against a 1000-char ACGT sequence.

[2]This is very optimistic even for the simple cases:

```timethis 10000, q[@m = ('acgt'x250 ) =~ m[(acgtacgtacgtacgtacgtacgta)]
+g ];
timethis 10000:  1 wallclock secs ( 1.16 usr +  0.00 sys =  1.16 CPU)
+@ 8643.04/s (n=10000)
```

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

If you double the number of 25-ers, you double the time.

Yes.

If you double the average length of the strings, you slightly less than double the time.

No: 1 + 50C1 + 50C2 = 1 + 50 + 1225 = 1276, nearly 4 times the 326 different regexps required for a length-25 string.

Of course there are two ways of allowing for fuzziness: replace /x/ with /[^x]/, or replace it with /./; the latter approach means you don't also need the 0- and 1-mismatch patterns, leaving 300 patterns (for length-25, up to 2 errors) or 1225 (for length-50).

Hugo

No: 1 + 50C1 + 50C2 = 1 + 50 + 1225 = 1276, nearly 4 times the 326 different regexps required for a length-25 string.

Sorry, I was lapse in my language. Instead of "...length of the strings...", I should have said sequences. That is, if you double the length of the sequences being searched, from 1000 to 2000 chars (average), but search for the same 25-char needle, then the time taken is slightly less that double.

I agree if the length of the needles double, the number of regexes required close to quadruples.

As I understand the problem, using /./ for the fuzziness is the right thing. However, whilst using a regex with 2 wildcards will match anythng that the same regex with only one wildcard would match, the problem with discarding the latter is that you will no longer be able to record how fuzzy the match was.

Other than perhaps capturing the wildcards and then using substr to determine which of the wildcards would have matched had it not been wild. Overall, the slowdown through capturing + the additional work determining the fuzz factor, it is probably quicker to retain the 1-mis and 2-mis regexes?

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Fantastic analysis!

I think it's interesting that the XOR method, which appears to be the most efficient, and widens it's lead as the problem space expands, corresponds more closely to how nature itself "searches" and "parses" amino acid strings than any of the other methods. Our sensibilities ought to help us realize that if millions of years of evolution have selected a solution to a problem, there must be something useful there.

Well BrowserUk, i just finished some testing of my approach. I was able to search a dictionary of 30 million words for one of 1245 possible fuzzy matches in 30 minutes on 555 mhz pc. Admittedly this was anchored at the first char not a match "in string" as it were.

Considering your extremely harsh comments in the CB id be very happy to run whatever code you choose against my code to see which is faster. The code takes as an argument a string of digits and then finds all 0 1 or 2 char mismatches of that string in the dictionary with the correct number of digits. If you provide code so that I can do the equivelent of

```  my \$match=\$trie->fetch(\$num);

Then ill run your code against mine with the same test set, etc and we will see whos solution is faster.

Since my solution represents a special case of the general problem yours should perform even better than you have suggested here so I expect that youll win. But until you have dont you think you should keep your comments at least polite?

---
demerphq

#### Why tries won't work for this application.

For 25-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's

1. 0-miss (exact.)

= 1

2. 1-miss

(CcM) * (A-1)M = 25c1 * (4-1)1 = 25 * 31 = 75.

3. 2-miss

(CcM) * (A-1)M = 25c2 * (4-1)2 = 300 * 32 = 2700.

Total 2776 keys. No problem building or with the performance of a Trie with 2776 keys.

Although the implementations currently on CPAN chew memory at a prodigious rate--a C implementation would be much less memory hungary as well as faster.

#### The problems.

To simplify the description of the problems, I'll start with using a 4-character key from an alphabet of ACGT.

Further, we'll take the OPs description and look for matches with 0, 1 or 2 mismatched characters.

For 4-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's

1. 0-miss (exact)

= 1

2. 1-miss

(CcM) * (A-1)M = 4c1 * (4-1)1 = 4 *31 = 12

3. 2-miss

(CcM) * (A-1)M = 4c2 * (4-1)2 = 6 * 32 = 54.

Total 67 keys.

Picking one of the 256 possible keys, I'll use 'ACGT'.

The keys we need to put into our trie in order to fuzzy match 'ACGT' are:

```## 0-miss (exact) match. 1 unique
ACGT

## 1-miss match: 4*4 = 16 - 4 duplicates of above = 12 unique
## With tags (°=exact; š=1-miss match; ˛=2-miss match) showing where
## a key resulted from multiple sources.
v        v        v        v  ## The vs indicate the columns varying.
-----------------------------
ACGT°˛  AAGT˛   ACAT˛   ACGA˛
CCGT˛   ACGT°˛  ACCT˛   ACGC˛
GCGT˛   AGGT˛   ACGT°˛  ACGG˛
TCGT˛   ATGT˛   ACTT˛   ACGT°˛

## 2-miss match: 6*4*4 = 96 -
## ( 6 duplicates of 0-miss + ( 3 each duplicates * 12 1-miss ) )
## With tags (°=exact; š=1-miss match; ˛=2-miss match) showing where
## a key resulted from multiple sources.
## = 54 unique.
vv      v v     v  v     vv      v v      vv
--------------------------------------------
AAGTš   ACATš   ACGAš   AAAT    AAGA    ACAA
CAGT    CCAT    CCGA    ACATš˛  ACGAš˛  ACCA
GAGT    GCAT    GCGA    AGAT    AGGA    ACGAš˛
TAGT    TCAT    TCGA    ATAT    ATGA    ACTA

ACGT°   ACCTš   ACGCš   AACT    AAGC    ACAC
CCGTš   CCCT    CCGC    ACCTš˛  ACGCš˛  ACCC
GCGTš   GCCT    GCGC    AGCT    AGGC    ACGCš˛
TCGTš   TCCT    TCGC    ATCT    ATGC    ACTC

AGGTš   ACGT°˛  ACGGš   AAGTš˛  AAGG    ACAG
CGGT    CCGTš˛  CCGG    ACGT°˛  ACGGš˛  ACCG
GGGT    GCGTš˛  GCGG    AGGTš˛  AGGG    ACGGš˛
TGGT    TCGTš˛  TCGG    ATGTš   ATGG    ACTG

ATGTš˛  ACTTš   ACGT°˛  AATT    AAGTš˛  ACATš˛
CTGT    CCTT    CCGTš˛  ACTTš˛  ACGT°˛  ACCTš˛
GTGT    GCTT    GCGTš˛  AGTT    AGGTš˛  ACGT°˛
TTGT    TCTT    TCGTš˛  ATTT    ATGTš˛  ACTTš˛

## Giving us our 67 unique keys.
```

Now it's no problem to use a more intelligent generation routine to avoid generating most of the duplicates.

Indeed, noticing that the brute-force, 2-miss generation process produces all of the keys required by the 0- and 1-miss cases, it would be a simple process skip those generation steps and simple de-dup the 2-miss set.

```## 2-miss match generation with duplicates removed
## With tags (°=exact, š=1-miss match) showing where
## a key resulted from multiple sources.
vv      v v     v  v     vv      v v      vv
AAGTš   ACATš   ACGAš   AAAT    AAGA    ACAA
CAGT    CCAT    CCGA                    ACCA
GAGT    GCAT    GCGA    AGAT    AGGA
TAGT    TCAT    TCGA    ATAT    ATGA    ACTA

ACGT°   ACCTš   ACGCš   AACT    AAGC    ACAC
CCGTš   CCCT    CCGC                    ACCC
GCGTš   GCCT    GCGC    AGCT    AGGC
TCGTš   TCCT    TCGC    ATCT    ATGC    ACTC

AGGTš           ACGGš           AAGG    ACAG
CGGT            CCGG                    ACCG
GGGT            GCGG            AGGG
TGGT            TCGG    ATGTš   ATGG    ACTG

ACTTš           AATT
CTGT    CCTT
GTGT    GCTT            AGTT
TTGT    TCTT            ATTT
```

In fact, provided your Trie implementation doesn't die or bellyache when you attempt to add a duplicate key, it will automatically perform the de-duping process for you.

#### And there's the rub

Once you start storing multiple sets of overlapping data into your Trie in order to make best use of it's performance, when you get a match, you can no longer distinguish whether this match occurred because it was an exact match, a 1-miss match or a 2-miss match.

So, even if you generate 1 Trie per key and apply them to each of the 976 25-char substrings in a 1000-char sequence 1 at a time (which is necessary using a Trie and has the beneficial side-effect of allowing you to know at what position within the sequence the match occured--which is a requirement), you still don't know why it matched.

Even if you arrange for the value associated with the key to be an array of tags associated with the sources of the key, that will only tell you the possible sources of the match, not the source.

So, whilst Tries are very good at doing fast lookups, and can be used for doing fuzzy lookups, the problem is that, like regex, they don't tell you what they matched.

With a regex, you can use the capture mechanism to discover this. Whilst this will slow the process somewhat, it is more than compensated for by the fact that the regex engine can be allowed to run through the entire string and locate every match and record where it found them (@- & @+), so the number of calls to the regex engine is 1/per sequence/per fuzzy match.

#### MegaTrie

Now it is possible to combine the keys generated from more than one exact key, into a single Trie. At least in theory, this would allow the 100,000 exact keys + all their fuzzy matches to be combined into one MegaTrie. This approximates a DFA, by solving all 100,000 * 2700 (2-miss) "Does it match?" questions, in one go, for each 25-char substring. Reducing the problem to 976 * 30,000 = 29,280,000 lookups (here defined as a transition from Perl into C and back). When compared with the 954,528,000,000,000 otherwise required by the 2-miss scenario, this sounds very inviting.

The problem is, that all you then know, is that one of the 100,000 exact matches, or one of their 270,000,000 possible fuzzy matches, did or did not match, each of the 29,280,000 substrings. So, you get to know something, (that the 25-char substring at position P of sequence S matched something)--very quickly--but not what (which of the 100,000), nor why (exact, 1-miss or 2--miss).

#### Fast pre-elimination?

At this point, I thought that maybe the MegaTrie idea could be used as a way of pre-eliminating some of the nearly 30 million substrings, prior to using (say) the XOR method to derive the required information. However, the fact that the (Mega)Trie needs to be applied individually to each substring--unlike a regex for example--and a Trie lookup involves either recursive or iterative sub-calls, it is slower than just using the XOR alone.

That means that there is no saving to be had using a Trie--even for this limited purpose.

#### Determanistic Finite-state Automaton

The ultimate expression of the Trie idea would be to write a DFA. Essentially this consists of a MegaTrie-like datastructure and a loop to apply that successively down the length of a string, substring-by-substring, reporting (the position(s)) of the matche(s) found. Much like the regex engine does, but failing early and moving on rather than backtracking*.

*It's worth noting that you can also use the 'cut' operator (?>...) to eliminate backtracking in the regex engine.)

This would reduce the number of lookups (again defined as a transition from Perl into C and back), to 30,000, though you still wouldn't know what matched, but the speed-up would offset the cost of further elimination.

The problems include:

• All the DFA modules I look at on CPAN used hash-based objects to represent the DFA-states, and as such would require huge amounts of memory to capture the complexity of this problem.
• Writing a C-based DFA is non trivial for a few, known states.

But this problem involves a huge number of states, and those states are going to vary from run to run. So the task is not writing a single DFA, but that of writing a DFA generator. This is an order of magnitude more complex to do.

• The shear volume of states involved means that it would be no good writing a poor implementation of a DFA and relying upon the natural speed of C to see you through.

Not only would it need to be a fairly efficiently code DFA, it would also need to be highly optimised in its use of memory (notoriously difficult) in order that the huge number of states would fit into memory concurrently.

• It would take a very highly skilled (C) programmer to write a DFA for this task that would out-perform the Perl regex engine and some suitably carefully crafted regex (using the cut operator).

The regex engine has had a lot of very skilled programmers working on it for a very long time. If the OP is upto this task, he is in the wrong job as a biochemist (Assumption!).

The gulf between what is theoretically possible and what is practical is very wide.

#### The point of all this?

Firstly, I'd done all the above work in order to verify my position anyway, and I thought someone might benefit from it.

Secondly, the next time someone calls you to question over something you have posted,

1. Don't get so upset by the form of the message that you ignore the content.

This is especially true of a private message between old sparing partners.

Even if it says "Utter bollocks", which is not rude, but a colloquiallism for "Utter rubbish" or "Completely in error".

2. Don't assume that you must be right and the other guy must be wrong.

You might be in error.

If it was worth posting, it is always worth a second (cool) assessment before you defend your post to the hilt.

3. Don't conflate the matter to which the message relates, with other, older issues, long dead.
4. Don't assume that because you have used the methods in question successfully on some other project, that the same methods will be applicable to the problem at hand.

Even if they sound remarkably similar at first reading.

5. Don't assume that your understanding of the CS issues involved are superior to the other guys understanding, no matter what history between you dictates.

He may always have learnt from prior encounters and/or further learning in the interim.

6. Don't expect the other guy to accept your statistics as proof.

Especially when those statistics are produced using a code you cannot show, running on data that is completely unrelated to the problem at hand.

7. Don't expect the other guy to produce code to validate your argument.

Especially when he has already published analysis, code, and verifiable statistics.

Especially when the code you are asking hm to produce would have to run against a system that you cannot show him.

Especially when that system uses data that is completely unrelated to the problem under discussion.

Especially when the information that system produces is completely inadequate for solving the problem at hand.

8. Don't think that you can never make a mistake.

There is nothing wrong with being human.

But, if the result of your mistake could lead others into considerable time and effort engaging in fruitless explorations, based on your mistake, do make sure that you clearly identify them when you realise.

FWIW, most of these have been applicable to me in the past. I'm no saint (in the other sense of the word) as you well know.

Oh! And if you ask for clarifications of the problem, do give the other guy the chance to produce them before:

• Raising the ante (by going public);
• Engaging him in extended bouts of discussion that would be unnecessary or simplified by that clarification.
• Firing back several, extended batches of dismissals, provarications and "fuck off you dickhead"s at him.

Thirdly, maybe a view through the "other guys eyes" of yesterday's conflagration will pursuade you to review your final act in our off-post communications?

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

demerphq You are wrong.

How long were your "words"? Less than 25 characters?

For the two-miss scenario, matching each of 100,000 25-character needles against each of 30,000 x 1000-char strings requires:

326 * 100,000 * 976 * 30,000 comparisons = 954,528,000,000,000 comparisons.

Your 30,000,000 * 1275 = 38,250,000,000 comparisons.

Your rate of comparisons is 21,250,000/second.

Your runtime to run the 100,000 x 30,000 x 1000 is:

1 year, 5 months, 4 days, 21 hours, 29 minutes, 24.7 seconds.

For the 3-miss scenario the numbers are:

2626 * 100,000 * 976 * 30,000 = 7,688,928,000,000,000.

Your runtime to run the 100,000 x 30,000 x 1000 is:

11 years, 5 months, 20 days, 2 hours, 51 minutes, 45.88 seconds.

For the 4-miss scenario the numbers are:

28,252 * 100,000 * 976 * 30,000 = 82,721,856,000,000,000.

Your runtime to run the 100,000 x 30,000 x 1000 is:

123 years, 4 months, 9 days, 17 hours, 27 minutes, 3.46 seconds.

As for your challenge. I have published my code. Where is yours? Try applying your method to real world data.

If you have somewhere I can post the 1000 x 1000-char randomly generated sequences (994 K) and the 1000 x 25-char randomly generated search strings (27 kb) then I will send them to you so that you can time the process of producing the required information of:

Sequence no/ offset/ fuzziness (number of mismatched characters) that my published test code produces in 28.5 minutes.

Then, and only then, when we are comparing eggs with eggs, will there be any point in continuing this discussion.

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Fuzzy Searching: Optimizing Algorithm Selection
by erix (Parson) on Nov 11, 2004 at 19:31 UTC

I would be really thankful if those who post bioinformatics questions would say what bioperl has to offer for their problem. Or, if they have not looked for it, say that.

This is a (general) request, not criticism. The problem is interesting enough. :)

Re: Fuzzy Searching: Optimizing Algorithm Selection
by NiJo (Friar) on Nov 12, 2004 at 22:47 UTC
Tats, can you afford some intermediary false positives? If yes, the Bloom filter will scale. It dramatically reduces space requirements and lets you play 'divide and conquer'. You set up e. g. 100 filters and test for hits. Your search space is down to 1%. As the bloom filter will produce some false positives, a proper retest is needed. But it will quickly weed out sure non-hitters. It has disadvantages: 1) Just hit/miss info. Not X matched Y. 2) Fixed lenght and 3) exact only. But fast. Even if it sounds impossible, please do the math for a brute force fixed length approach. Perl.com has a good article on Bloom filters and CPAN has Bloom::Filter by the same author. In an unfinished hobby project I based the Bloom filter operations on Bit::Vector instead of pure perl vec and got a huge speed increase (more than tenfold, increasing with size). If you use Bloom filters in your project, you owe us a C implementation :-). Given the estimated run time for other approaches it might be feasible to get easy results quickly. For a bloom filter that is exact matches. Hope that helps, Johannes Nieß
Re: Fuzzy Searching: Optimizing Algorithm Selection
by demerphq (Chancellor) on Nov 12, 2004 at 16:00 UTC

Update:Please dont forget to note the sentence at the end which points out that I got muddled and was thinking of only when the string was matched at the start. Sorry.

I might approach the problem by first converting my search string into all of the possible other strings that I will consider to be a fuzzy match. Then build a trie to represent this structure. Then use the trie to search the dictionaries. (You may want to code the Trie in Inline::C.) This will outperform regexes in my experience (Trie lookup is pretty well optimal, O(KeyLen)~O(1),). If i had to do these searches a lot and the size of my alphabet was reasonable, i would look at producing fingerprints of my dictionary (ie, all words with the same letters in them (regardless of counts) would have the same fingerprint"). Assuming you have a fast way to get the list associated with a given fingerprint you should have cut your search space vastly as a single XOR against the fingerprint will tell you if a match is possible. Even if the alphabet was small (like ACTG) you could use tuples for your fingerprinting algorithm, so for instance assuming you use an integer for the fingerprint you could assign one bit to each possible combination.

I could probably put together an implementation that worked on a base 10 alphabet, (ie Numeric Digits) and searched a 30 million record dictionary for matches if you provided a routine to generate the match list. Unfortunately i couldnt show you the code, but the results may tell you if the techniques were appropriate.

Incidentally afaict the upper time of this approach would be 25 * N where N is the number of words in your dictionary. Assuming you fingerprint then the time will be much lower than that. Thus the algorithm would be O(N). Which im pretty sure is faster than anything else posted here.

An alternative approach would be to store all of your dictionaries in one great big trie and then reverse the process, looking up each fuzzy word in the trie and recording the words matched. This would only be suitable if the density of the data stored in the dictionaries was high and/or memory was abundant, but would have the advantage of being almost instantaneous to search. Assuming that 1 word had 300 fuzzy equivelents the structure should be searchable in 300*25 timesteps regardless of dictionary size.

Gah, I jsut realized that the above only applies if you need to match from the front. If you need to match from any position in the dictionary words then you need to go beyond a trie.

---
demerphq

Re: Fuzzy Searching: Optimizing Algorithm Selection
by Anonymous Monk on Nov 12, 2004 at 18:19 UTC
There was a few years ago an article in DDJ about searching with Burroughs-Wheeler-Transform. The algo goes essentially like this: BWT your search space. All the strings with the same last letter will be together. Search for the first and last of them. They will give you index to second last letter. Now you can part your range according to second last letter in search string. Repeat 25. If you dont part you do a fuzzi. See the article http://www.ddj.com/articles/2003/0312/
Re: Fuzzy Searching: Optimizing Algorithm Selection
by ambrus (Abbot) on Jun 01, 2008 at 18:17 UTC

Get a perl 5.010. Download the source of re::engine::TRE. Download the full source of the TRE library. Replace the reduced source that comes with the module with the full source. Change the configuration so TRE will be built with fuzzy matching support enabled. Build, test, debug, fix, repeat. You get fast fuzzy matching that enables you to do exactly what you want.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://406836]
Approved by cLive ;-)
Front-paged by cLive ;-)
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2020-05-28 08:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
If programming languages were movie genres, Perl would be:

Results (165 votes). Check out past polls.

Notices?