Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Finding Nearly Identical Sets

by Limbic~Region (Chancellor)
on Sep 28, 2016 at 14:51 UTC ( #1172842=perlquestion: print w/replies, xml ) Need Help??

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
This isn't so much a perl question as an algorithm question but the solution will be coded in perl (at least partially).

The program will be processing millions of messages that contain a set of 9 digits where each digit may have a value of 1-9. For instance, it may contain (1, 7, 3, 3, 9, 5, 6, 1, 2). It is critical that all messages that contain the same set need to be stored together.

Unfortunately, the set may contain an error (think typo). If a set doesn't match any previously seen sets exactly, the program needs to determine if this should be a new set or if it belongs to a previous set. Assume that I have a perfect way of doing this if I have two messages side by side, what I am looking for is a fast/cheap way to identify candidate sets.

In other words, I can't compare each set to every previous set. What I want to be able to do is very quickly/cheaply identify some candidates that are worth the time to do an expensive message to message compare. Here are the types of things I want to allow for:

  • Any change to the order (1,2,3,4,5,6,7,8,9 instead of 9,8,7,6,5,4,3,2,1) AND/OR 1 of the following:
  • Exactly 1 insertion (10 digits instead of 9) OR
  • Exactly 1 deletion (8 digits instead of 9) OR
  • Exactly 1 transformation (7 instead of a 5)

If I were just going with the first one, it would be simple. Sort/concatenate the list and perform a hash lookup.

I am sick with a pretty bad head cold so I am going to assume I have done a poor job of explaining. I apologize in advance. Here are some of the ideas I have came up with that I don't think will work:

  • Using a range of the product of the set
  • Using a range for the average and deviation of the set
  • Using a range for the "distance" to another 3rd hardcoded set

Fast and cheap but without too many false positives - ideas?

Cheers - L~R

Replies are listed 'Best First'.
Re: Finding Nearly Identical Sets
by BrowserUk (Pope) on Sep 28, 2016 at 16:36 UTC

    A few questions:

    • You already have the set of known 9-digits numbers to compare against?

      If so, how are they currently stored?

      If not, will you just take the numbers from the first batch as 'good'?

      Or are you hoping to cross-check the first batch against the second and detect errors that might be in the first set?

      If you have a known good set, then it is simply a case of checking if the number found already exists in that known set, and that can be done very quickly using a 106MB bitstring.

    • You already have a method of finding & extracting the number from the message; even if it is 8 or 10 digits.

      Ie. Its location is fixed, or clearly prefixed, or it will be the only large (say more than 7-digits) number in the message.

      What I'm getting at here is the insertions or deletions are easily detected, because the numbers are the wrong length. That only really leaves transpositions to detect?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      I am going to answer some of your questions out of order (easiest to answer to hardest).

      You already have a method of finding & extracting the number from the message; even if it is 8 or 10 digits.

      Correct

      What I'm getting at here is the insertions or deletions are easily detected, because the numbers are the wrong length. That only really leaves transpositions to detect?

      Incorrect. While it is trivial to determine that a mistake has been made, it isn't necessarily trivial to determine what the mistake was (which set the number should have been).

      You already have the set of known 9-digits numbers to compare against?

      No. Think of it this way. The set represents what should be a unique identifier for the message sender. The first time I receive a message with a valid set (meets my aforementioned criterion) that doesn't appear to be similar to any other previously seen set - I will use that set to identify that sender. Not because it is correct but because it was first. A subsequent message from that sender may be objectively correct but for my purposes I don't know - I only know that it is nearly identical to what I have already seen.

      If so, how are they currently stored?

      I'm still figuring that out but because there will be parallel processing going on, I am assuming it will be in a database and finding good candidates will need to be 1 or more index based queries.

      Hopefully I have answered all of your questions satisfactorily enough to advance the problem but my cold medicine is really doing a number on my cognitive abilities.

      Cheers - L~R

        a valid set

        When you say "set", you're referring to one group of nine, single digits? (I'm wondering why "set" rather than just 'number'?)

        With nine digits, 1 .. 9, there are 888,888,888 possibilities; how many of those do you expect to be "receiving"?

        Over what time period? Do you reset at any point?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      If it helps, you don't have to think of them as sets but as strings or numbers (driver's license, passport, employee ID, etc.) with the understanding that they are expected to not contain zero anywhere and should be a length of 9.

      For instance, let's say the program has '198472385'

      1. Is it less than 8 characters long? If yes, abort
      2. Is it more than 10 characters long? If yes, abort
      3. Has this "thing" already been seen? If yes, we are done
      4. Has a variation of this thing been seen (same digits but in a different order)? If yes, we are done
      5. Are there any other "things" nearly identical to this one regardless of the order of the digits (can I transform this "thing" into a previously seen "thing" with exactly one operation (insert, delete, transform))? If yes, which ones?

      Items 1 - 4 are obviously trivial. I don't even need to be perfect with number 5 though that would make me happy. I am trying to find a way to quickly/cheaply find candidates to number 5.

      One way to do it would be just to have a bunch of indices. For instance, for "a single digit has been transformed".

      1. When storing - sort the set
      2. For each of the 9 possible positions that could be changed in the future, assume each one was in fact changed and remove that character and concatenate the remaining 8 characters into an index/hash
      3. When checking candidates, duplicate the process above checking all 9 indices for an exact match
      The above scenario would work perfectly for transforms but requires you to maintain 9 different indices and the searches are ugly.

      Cheers - L~R

Re: Finding Nearly Identical Sets
by LanX (Cardinal) on Sep 28, 2016 at 19:56 UTC
    IIRC these structures are called multisets because some elements are repeated in one of your examples.

    If I understand your requirements correctly, you can use your approach in a pragmatic way, because any "neighboring" multi sets must have at least 8 digits in common.

    So

    • Sort the 9 elements of the original multi set
    • Calculate all distinct 8 element subsets (at most 9)
    • Create hash keys by joining them
    • push the original set into a HoA ( or HoH) for each subset.

    At the end you'll only need 9 hash look ups to drastically narrow down potential candidates.

    NB: That's a pragmatic approach, a detailed survey might show more efficient algorithms.

    HTH :)

    PS: this problem reminds me of hamming distance of error correcting codes, but I doubt you can easily apply this here.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

    update

    I just realized that you already sketched that approach in  Re^2: Finding Nearly Identical Sets . Not sure why you say it's ugly, cause a HoH should be quite fast, and you'd need to check anyway, if your input is equidistant to multiple neighbors.

      LanX,

      Thanks!

      I was actually looking at Hilberts Curve. As far as the 9 subset/hash, I outlined that solution yesterday in a response to BrowserUk. I was hoping for something more elegant which is why I asked here.

      Cheers - L~R

        Again, for me it is elegant.

        It's fast, easy to implement and its matching the requirements.

        Well unless you have more in mind than you told us.

        Better solutions (like a more straight away normalization) would require your codepoints to have a guaranteed minimal distance.

        This could certainly be achieved, but well regarding the generality of your question, it'll be far fetched to suppose such a structure.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: Finding Nearly Identical Sets (Updated:4200/sec)
by BrowserUk (Pope) on Sep 29, 2016 at 06:54 UTC

    This randomly generates numbers (sets) -- 82% valid, 9% with a digit missing, 9% with an extra digit -- and builds 4 indexes as it runs:

    1. $knowns: is a 125MB bitmap that can hold all your 9-digit numbers, and has a bit set representing each number known (seen) good value.
    2. $deletions: is a 12.5MB bitmap that can hold all 8-digit numbers, and gets a bit set to represent each of the 8-digit numbers, that are a match-with-1-deletion, for each of the known 9-digit numbers.
    3. $insertions: Is a 1.25GB bitmap that can hold all 10-digit numbers, and gets a bit set to represent each of the 10-digit numbers, that are a match-with-1-insertion for each of the known 9 digit numbers.
    4. $transformations: Is a 125MB bitmap that holds all the match-with-1-substitutions, for each of the known 9-digit numbers.

    It requires a 64-bit perl > 5.18; consumes a fixed 3GB ram when running, and processes the numbers at a rate of 4000 per second; with a run of 10 million taking 40 minutes.

    The bitmaps can be saved to disk after a run using the option -SAVE and restored at the beginning of another run using the option -LOAD. The mechanism is currently quite crude using a fixed name of "$0.bin".

    It currently prints out whether a number matches a known number: exactly or with 1 deletion, insertion or substitution.

    It does not currently tell you which known number(s) it matches -- storing the information required to do that would require terabytes of data -- but given the process of generating the possibilities for any given 1-digit edit is so fast, repeating the process just for those detected -- approximately 0.01% for my randoms -- prior to doing your full message text match is no hardship.

    The comments are sparse, but should gives some cluebats. Yell if you want more.

    #! perl -slw use strict; use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; sub receive{ my $n = join '', map 1+int( rand 9 ), 1 .. 9; chop $n if rand() < 0.1; $n .= 1+int( rand 9 ) if rand() > 0.9; return $n; } $|++; our $SAVE //= 0; our $LOAD //= 0; our $SRAND //= 0; srand( $SRAND ) if $SRAND; our $N //= 1e5; my( $knowns, $deletions, $insertions, $transformations ) = ( chr(0)x12 +5_000_000, chr(0)x12_500_000, chr(0)x1_250_000_000, chr(0)x125_000_00 +0 ); my( $known, $new, $deletion, $insertion, $transformed ) = (0) x 5; if( $LOAD ) { open BIN, '<:raw', "$0.bin" or die $!; { local $/ = \1_250_000_000; $insertions = <BIN>; } { local $/ = \ 125_000_000; $knowns = <BIN>; } { local $/ = \ 125_000_000; $transformations = <BIN>; } { local $/ = \ 12_500_000; $deletions = <BIN>; } close BIN; $known = unpack '%32b*', $knowns; } my $start = time; for ( 1 .. $N ) { printf STDERR "\r%10d\t", $_ unless $_ % 1000; my $received = receive(); if( length( $received ) == 9 ) { if( vec( $knowns, $received, 1 ) ) { print( "$received matched a known number" ); ++$known; next; } elsif( vec( $transformations, $received, 1 ) ) { print "$received matches a known number with a substitutio +n."; ++$transformed; next; } } elsif( length( $received ) == 8 ) { if( vec( $deletions, $received, 1 ) ) { print "$received matches a known number with a deletion."; ++$deletion; } next; } elsif( length( $received ) == 10 ) { if( vec( $insertions, $received, 1 ) ) { print "$received matches a known number with an insertion. +"; ++$insertion; } next; } ++$new, vec( $knowns, $received, 1 ) = 1; ## new number for my $pos ( 0 .. 8 ) { my $copy; vec( $deletions, substr( $copy = $received, $pos, 1, '' ), 1 + ) = 1; ## add all possible 1-digit deletions to th +eir index vec( $insertions, substr( $copy = $received, $pos, 1, $_ ), 1 + ) = 1 for 1 .. 9; ## add all possible 1-digit insertions to t +heir index my $digit = substr( $received, $pos, 1 ); vec( $transformations, substr( $copy = $received, $pos, 1, $_ +), 1 ) = 1 for 1 .. $digit-1, $digit+1 .. 9; ## all poss 1-digit sub +stitutions } vec( $insertions, $received . $_, 1 ) = 1 for 1 .. 9; + ## all possible insertions after last digi +t. } printf STDERR "From %d received there were: %d new; %d known; %d delet +ions; %d insertions; %d transformations.\n", $N, $new, $known, $deletion, $insertion, $transformed; printf STDERR "Numbers processed at a rate of %.f/second\n", $N / ( ti +me() - $start ); if( $SAVE ) { open BIN, '>:raw', "$0.bin" or die $!; printf BIN "%s%s%s%s", $knowns, $deletions, $insertions, $transfor +mations; close BIN; } __END__ C:\test>\Perl22\bin\perl.exe 1172842.pl -N=1e7 -SAVE > null 10000000 From 10000000 received there were: 8116086 new; 86155 +known; 0 deletions; 9459 insertions; 0 transformations. Numbers processed at a rate of 4082/second C:\test>\Perl22\bin\perl.exe 1172842.pl -N=1e7 -SAVE -LOAD > null 10000000 From 10000000 received there were: 7963211 new; 836492 +9 known; 0 deletions; 24187 insertions; 0 transformations. Numbers processed at a rate of 4218/second

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.i; } { local $/ = \12_500_000; $deletions = /li

      FWIW: my fears that the transforms were not being detected were unfounded. There are 10 times as many possibilities for 1-insertion matches than there are for the deletion & substitution varieties, so it starts detecting insertions almost straight away, but the other two not so. However, once the number of known instances gets high enough, they start to show in roughly the expected proportions:

      C:\test>\Perl22\bin\perl.exe 1172842.pl -N=1e6 -SAVE -LOAD > null 1000000 From 1000000 received there were: 767637 new; 16713734 + known; 0 deletions; 2899 insertions; 19788 transformations. Numbers processed at a rate of 4366/second C:\test>\Perl22\bin\perl.exe 1172842.pl -N=5e7 -SAVE -LOAD > null 50000000 From 50000000 received there were: 36232459 new; 21068 +158 known; 0 deletions; 353102 insertions; 1173728 transformations. Numbers processed at a rate of 4592/second C:\test>\Perl22\bin\perl.exe 1172842.pl -N=5e7 -SAVE -LOAD > null 50000000 From 50000000 received there were: 33583555 new; 52246 +313 known; 0 deletions; 650237 insertions; 1281578 transformations. Numbers processed at a rate of 4767/second C:\test>\Perl22\bin\perl.exe 1172842.pl -N=5e7 -SAVE -LOAD > null 50000000 From 50000000 received there were: 30372628 new; 86174 +891 known; 0 deletions; 899052 insertions; 1715170 transformations. Numbers processed at a rate of 5344/second C:\test>\Perl22\bin\perl.exe 1172842.pl -N=5e7 -SAVE -LOAD > null 50000000 From 50000000 received there were: 27148533 new; 11871 +9247 known; 93783 deletions; 1103187 insertions; 2202499 transformati +ons. Numbers processed at a rate of 5900/second

      Note also that this code really loves having the system to itself -- the last couple of runs were overnight when the machine was otherwise idle -- and the performance jumps up from 4200/s to 5900/s; +40%.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,

      Thanks - I only skim read the code but I think I understand how it works. Unfortunately, a Boolean of yes/no regarding if it has been seen before without being able to retrieve the near matches isn't going to be practical.

      I will check back in later.

      Cheers - L~R

        Unfortunately, a Boolean of yes/no regarding if it has been seen before without being able to retrieve the near matches isn't going to be practical.

        This is just the first, very fast, filter. The same method that generates *all the near matches* for *all* the known sets, in this filter, can be used again in a second pass, on individual near matches, to find the number(s) they match to.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding Nearly Identical Sets
by shmem (Chancellor) on Sep 28, 2016 at 21:21 UTC

    First task to do is to prepare yourself a big bowl of lime blossom tea (tilia). Let it brew for 10 minutes, take care that it doesn't become cold - a thermos flask would keep it hot. Add plenty of honey, a pinch of salt, and lemon juice. Prepare your bed with some towels on the bed sheet, have more towels to cover yourself; get into bed, cover yourself with towels then blanket(s), and drink the tea really really hot, just below the temperature which would hurt you. Keep yourself covered snug and warm, close your eyes and sweat the crap out. --

    Next, the identifier thing. I have the gut feeling that a solution involving binary trees and partial match could provide a nice solution. Whilst you sweat along, I'm going to evolve that thought as I am typing.

    Since you say that you accept Any change to the order (1,2,3,4,5,6,7,8,9 instead of 9,8,7,6,5,4,3,2,1) you may as well store keys with an exact length as a split/sort/concatenate key. Then again, you could also store the reversed sorted key in the B-Tree database.

    If you do a lookup of the split/sort/concatenate candidate in the database, you could get a full match - done. If you get a partial match, look up the reversed sorted candidate. The difference between the sum of both partial key lengths and the required key length should be 1, no matter whether insertion, deletion or transformation of 1 digit is the cause of difference. If it is any longer, more than 1 transformation has been done to the expected key.

    See DB_File's BTREE section, and maybe IP prefixes: match IP against "best" network.

    update

    Ahem. The difference between the sum of both partial match keys to the expected length should be 1 for insertion/deletion, and 2 for transformation, I guess...not. - I'm too tired right now to code the algorithm...

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Finding Nearly Identical Sets
by herveus (Parson) on Sep 28, 2016 at 19:01 UTC
    Howdy!

    Using the word "set" is a bit confusing. What I think you have is a string of digits. Going from that, I think some sort of Levenshtein distance is involved here. I note the existance of Text::Levenshtein on CPAN.

    yours,
    Michael
      herveus,

      It isn't a string of digits though you could think of them that way.

      Have you ever tried executing the Levenshtein edit distance a trillion times? Even the XS version isn't that fast. Let's say I get 2 million messages a day and I have 500K different sets/strings to compare against - this isn't the way to go.

      Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1172842]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2020-07-10 03:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?