Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Finding Nearly Identical Sets

by BrowserUk (Patriarch)
on Sep 28, 2016 at 16:36 UTC ( [id://1172848]=note: print w/replies, xml ) Need Help??


in reply to Finding Nearly Identical Sets

A few questions:

  • You already have the set of known 9-digits numbers to compare against?

    If so, how are they currently stored?

    If not, will you just take the numbers from the first batch as 'good'?

    Or are you hoping to cross-check the first batch against the second and detect errors that might be in the first set?

    If you have a known good set, then it is simply a case of checking if the number found already exists in that known set, and that can be done very quickly using a 106MB bitstring.

  • You already have a method of finding & extracting the number from the message; even if it is 8 or 10 digits.

    Ie. Its location is fixed, or clearly prefixed, or it will be the only large (say more than 7-digits) number in the message.

    What I'm getting at here is the insertions or deletions are easily detected, because the numbers are the wrong length. That only really leaves transpositions to detect?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Finding Nearly Identical Sets
by Limbic~Region (Chancellor) on Sep 28, 2016 at 17:33 UTC
    BrowserUk,
    I am going to answer some of your questions out of order (easiest to answer to hardest).

    You already have a method of finding & extracting the number from the message; even if it is 8 or 10 digits.

    Correct

    What I'm getting at here is the insertions or deletions are easily detected, because the numbers are the wrong length. That only really leaves transpositions to detect?

    Incorrect. While it is trivial to determine that a mistake has been made, it isn't necessarily trivial to determine what the mistake was (which set the number should have been).

    You already have the set of known 9-digits numbers to compare against?

    No. Think of it this way. The set represents what should be a unique identifier for the message sender. The first time I receive a message with a valid set (meets my aforementioned criterion) that doesn't appear to be similar to any other previously seen set - I will use that set to identify that sender. Not because it is correct but because it was first. A subsequent message from that sender may be objectively correct but for my purposes I don't know - I only know that it is nearly identical to what I have already seen.

    If so, how are they currently stored?

    I'm still figuring that out but because there will be parallel processing going on, I am assuming it will be in a database and finding good candidates will need to be 1 or more index based queries.

    Hopefully I have answered all of your questions satisfactorily enough to advance the problem but my cold medicine is really doing a number on my cognitive abilities.

    Cheers - L~R

      a valid set

      When you say "set", you're referring to one group of nine, single digits? (I'm wondering why "set" rather than just 'number'?)

      With nine digits, 1 .. 9, there are 888,888,888 possibilities; how many of those do you expect to be "receiving"?

      Over what time period? Do you reset at any point?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk,

        I was trying to reduce the problem down to only the parts that matter. Through out the message, there are 10 markers - I am extracting and only interested in 9 of them. It is weird and highly domain specific (which I can't discuss).

        This "set" of numbers for a given message is valid over a few days (let's say 10 days tops). The reset happens when the sender's right to send expires and they are removed so it is possible for the same numbers to be re-used at some point in the future but they would be considered "new" then. At any one time, I think I would have around 500K of the nearly 900 million possible.

        Cheers - L~R

Re^2: Finding Nearly Identical Sets
by Limbic~Region (Chancellor) on Sep 28, 2016 at 18:36 UTC
    BrowserUk,
    If it helps, you don't have to think of them as sets but as strings or numbers (driver's license, passport, employee ID, etc.) with the understanding that they are expected to not contain zero anywhere and should be a length of 9.

    For instance, let's say the program has '198472385'

    1. Is it less than 8 characters long? If yes, abort
    2. Is it more than 10 characters long? If yes, abort
    3. Has this "thing" already been seen? If yes, we are done
    4. Has a variation of this thing been seen (same digits but in a different order)? If yes, we are done
    5. Are there any other "things" nearly identical to this one regardless of the order of the digits (can I transform this "thing" into a previously seen "thing" with exactly one operation (insert, delete, transform))? If yes, which ones?

    Items 1 - 4 are obviously trivial. I don't even need to be perfect with number 5 though that would make me happy. I am trying to find a way to quickly/cheaply find candidates to number 5.

    One way to do it would be just to have a bunch of indices. For instance, for "a single digit has been transformed".

    1. When storing - sort the set
    2. For each of the 9 possible positions that could be changed in the future, assume each one was in fact changed and remove that character and concatenate the remaining 8 characters into an index/hash
    3. When checking candidates, duplicate the process above checking all 9 indices for an exact match
    The above scenario would work perfectly for transforms but requires you to maintain 9 different indices and the searches are ugly.

    Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1172848]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2024-04-26 06:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found