comment on

All,
This isn't so much a perl question as an algorithm question but the solution will be coded in perl (at least partially).

The program will be processing millions of messages that contain a set of 9 digits where each digit may have a value of 1-9. For instance, it may contain (1, 7, 3, 3, 9, 5, 6, 1, 2). It is critical that all messages that contain the same set need to be stored together.

Unfortunately, the set may contain an error (think typo). If a set doesn't match any previously seen sets exactly, the program needs to determine if this should be a new set or if it belongs to a previous set. Assume that I have a perfect way of doing this if I have two messages side by side, what I am looking for is a fast/cheap way to identify candidate sets.

In other words, I can't compare each set to every previous set. What I want to be able to do is very quickly/cheaply identify some candidates that are worth the time to do an expensive message to message compare. Here are the types of things I want to allow for:

Any change to the order (1,2,3,4,5,6,7,8,9 instead of 9,8,7,6,5,4,3,2,1) AND/OR 1 of the following:
Exactly 1 insertion (10 digits instead of 9) OR
Exactly 1 deletion (8 digits instead of 9) OR
Exactly 1 transformation (7 instead of a 5)

If I were just going with the first one, it would be simple. Sort/concatenate the list and perform a hash lookup.

I am sick with a pretty bad head cold so I am going to assume I have done a poor job of explaining. I apologize in advance. Here are some of the ideas I have came up with that I don't think will work:

Using a range of the product of the set
Using a range for the average and deviation of the set
Using a range for the "distance" to another 3rd hardcoded set

Fast and cheap but without too many false positives - ideas?

Cheers - L~R

In reply to Finding Nearly Identical Sets by Limbic~Region

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


We don't bite newbies here... much
	PerlMonks