Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:
This isn't so much a perl question as an algorithm question but the solution will be coded in perl (at least partially).
The program will be processing millions of messages that contain a set of 9 digits where each digit may have a value of 1-9. For instance, it may contain (1, 7, 3, 3, 9, 5, 6, 1, 2). It is critical that all messages that contain the same set need to be stored together.
Unfortunately, the set may contain an error (think typo). If a set doesn't match any previously seen sets exactly, the program needs to determine if this should be a new set or if it belongs to a previous set. Assume that I have a perfect way of doing this if I have two messages side by side, what I am looking for is a fast/cheap way to identify candidate sets.
In other words, I can't compare each set to every previous set. What I want to be able to do is very quickly/cheaply identify some candidates that are worth the time to do an expensive message to message compare. Here are the types of things I want to allow for:
- Any change to the order (1,2,3,4,5,6,7,8,9 instead of 9,8,7,6,5,4,3,2,1) AND/OR 1 of the following:
- Exactly 1 insertion (10 digits instead of 9) OR
- Exactly 1 deletion (8 digits instead of 9) OR
- Exactly 1 transformation (7 instead of a 5)
If I were just going with the first one, it would be simple. Sort/concatenate the list and perform a hash lookup.
I am sick with a pretty bad head cold so I am going to assume I have done a poor job of explaining. I apologize in advance. Here are some of the ideas I have came up with that I don't think will work:
- Using a range of the product of the set
- Using a range for the average and deviation of the set
- Using a range for the "distance" to another 3rd hardcoded set
Fast and cheap but without too many false positives - ideas?
Cheers - L~R