http://www.perlmonks.org?node_id=1172842

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
This isn't so much a perl question as an algorithm question but the solution will be coded in perl (at least partially).

The program will be processing millions of messages that contain a set of 9 digits where each digit may have a value of 1-9. For instance, it may contain (1, 7, 3, 3, 9, 5, 6, 1, 2). It is critical that all messages that contain the same set need to be stored together.

Unfortunately, the set may contain an error (think typo). If a set doesn't match any previously seen sets exactly, the program needs to determine if this should be a new set or if it belongs to a previous set. Assume that I have a perfect way of doing this if I have two messages side by side, what I am looking for is a fast/cheap way to identify candidate sets.

In other words, I can't compare each set to every previous set. What I want to be able to do is very quickly/cheaply identify some candidates that are worth the time to do an expensive message to message compare. Here are the types of things I want to allow for:

If I were just going with the first one, it would be simple. Sort/concatenate the list and perform a hash lookup.

I am sick with a pretty bad head cold so I am going to assume I have done a poor job of explaining. I apologize in advance. Here are some of the ideas I have came up with that I don't think will work:

Fast and cheap but without too many false positives - ideas?

Cheers - L~R