Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Finding Nearly Identical Sets (Updated:4200/sec)

by BrowserUk (Pope)
on Sep 29, 2016 at 06:54 UTC ( #1172893=note: print w/replies, xml ) Need Help??


in reply to Finding Nearly Identical Sets

This randomly generates numbers (sets) -- 82% valid, 9% with a digit missing, 9% with an extra digit -- and builds 4 indexes as it runs:

  1. $knowns: is a 125MB bitmap that can hold all your 9-digit numbers, and has a bit set representing each number known (seen) good value.
  2. $deletions: is a 12.5MB bitmap that can hold all 8-digit numbers, and gets a bit set to represent each of the 8-digit numbers, that are a match-with-1-deletion, for each of the known 9-digit numbers.
  3. $insertions: Is a 1.25GB bitmap that can hold all 10-digit numbers, and gets a bit set to represent each of the 10-digit numbers, that are a match-with-1-insertion for each of the known 9 digit numbers.
  4. $transformations: Is a 125MB bitmap that holds all the match-with-1-substitutions, for each of the known 9-digit numbers.

It requires a 64-bit perl > 5.18; consumes a fixed 3GB ram when running, and processes the numbers at a rate of 4000 per second; with a run of 10 million taking 40 minutes.

The bitmaps can be saved to disk after a run using the option -SAVE and restored at the beginning of another run using the option -LOAD. The mechanism is currently quite crude using a fixed name of "$0.bin".

It currently prints out whether a number matches a known number: exactly or with 1 deletion, insertion or substitution.

It does not currently tell you which known number(s) it matches -- storing the information required to do that would require terabytes of data -- but given the process of generating the possibilities for any given 1-digit edit is so fast, repeating the process just for those detected -- approximately 0.01% for my randoms -- prior to doing your full message text match is no hardship.

The comments are sparse, but should gives some cluebats. Yell if you want more.

#! perl -slw use strict; use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; sub receive{ my $n = join '', map 1+int( rand 9 ), 1 .. 9; chop $n if rand() < 0.1; $n .= 1+int( rand 9 ) if rand() > 0.9; return $n; } $|++; our $SAVE //= 0; our $LOAD //= 0; our $SRAND //= 0; srand( $SRAND ) if $SRAND; our $N //= 1e5; my( $knowns, $deletions, $insertions, $transformations ) = ( chr(0)x12 +5_000_000, chr(0)x12_500_000, chr(0)x1_250_000_000, chr(0)x125_000_00 +0 ); my( $known, $new, $deletion, $insertion, $transformed ) = (0) x 5; if( $LOAD ) { open BIN, '<:raw', "$0.bin" or die $!; { local $/ = \1_250_000_000; $insertions = <BIN>; } { local $/ = \ 125_000_000; $knowns = <BIN>; } { local $/ = \ 125_000_000; $transformations = <BIN>; } { local $/ = \ 12_500_000; $deletions = <BIN>; } close BIN; $known = unpack '%32b*', $knowns; } my $start = time; for ( 1 .. $N ) { printf STDERR "\r%10d\t", $_ unless $_ % 1000; my $received = receive(); if( length( $received ) == 9 ) { if( vec( $knowns, $received, 1 ) ) { print( "$received matched a known number" ); ++$known; next; } elsif( vec( $transformations, $received, 1 ) ) { print "$received matches a known number with a substitutio +n."; ++$transformed; next; } } elsif( length( $received ) == 8 ) { if( vec( $deletions, $received, 1 ) ) { print "$received matches a known number with a deletion."; ++$deletion; } next; } elsif( length( $received ) == 10 ) { if( vec( $insertions, $received, 1 ) ) { print "$received matches a known number with an insertion. +"; ++$insertion; } next; } ++$new, vec( $knowns, $received, 1 ) = 1; ## new number for my $pos ( 0 .. 8 ) { my $copy; vec( $deletions, substr( $copy = $received, $pos, 1, '' ), 1 + ) = 1; ## add all possible 1-digit deletions to th +eir index vec( $insertions, substr( $copy = $received, $pos, 1, $_ ), 1 + ) = 1 for 1 .. 9; ## add all possible 1-digit insertions to t +heir index my $digit = substr( $received, $pos, 1 ); vec( $transformations, substr( $copy = $received, $pos, 1, $_ +), 1 ) = 1 for 1 .. $digit-1, $digit+1 .. 9; ## all poss 1-digit sub +stitutions } vec( $insertions, $received . $_, 1 ) = 1 for 1 .. 9; + ## all possible insertions after last digi +t. } printf STDERR "From %d received there were: %d new; %d known; %d delet +ions; %d insertions; %d transformations.\n", $N, $new, $known, $deletion, $insertion, $transformed; printf STDERR "Numbers processed at a rate of %.f/second\n", $N / ( ti +me() - $start ); if( $SAVE ) { open BIN, '>:raw', "$0.bin" or die $!; printf BIN "%s%s%s%s", $knowns, $deletions, $insertions, $transfor +mations; close BIN; } __END__ C:\test>\Perl22\bin\perl.exe 1172842.pl -N=1e7 -SAVE > null 10000000 From 10000000 received there were: 8116086 new; 86155 +known; 0 deletions; 9459 insertions; 0 transformations. Numbers processed at a rate of 4082/second C:\test>\Perl22\bin\perl.exe 1172842.pl -N=1e7 -SAVE -LOAD > null 10000000 From 10000000 received there were: 7963211 new; 836492 +9 known; 0 deletions; 24187 insertions; 0 transformations. Numbers processed at a rate of 4218/second

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.i; } { local $/ = \12_500_000; $deletions = /li

Replies are listed 'Best First'.
Re^2: Finding Nearly Identical Sets (Updated:4200/sec)
by BrowserUk (Pope) on Sep 30, 2016 at 04:31 UTC

    FWIW: my fears that the transforms were not being detected were unfounded. There are 10 times as many possibilities for 1-insertion matches than there are for the deletion & substitution varieties, so it starts detecting insertions almost straight away, but the other two not so. However, once the number of known instances gets high enough, they start to show in roughly the expected proportions:

    C:\test>\Perl22\bin\perl.exe 1172842.pl -N=1e6 -SAVE -LOAD > null 1000000 From 1000000 received there were: 767637 new; 16713734 + known; 0 deletions; 2899 insertions; 19788 transformations. Numbers processed at a rate of 4366/second C:\test>\Perl22\bin\perl.exe 1172842.pl -N=5e7 -SAVE -LOAD > null 50000000 From 50000000 received there were: 36232459 new; 21068 +158 known; 0 deletions; 353102 insertions; 1173728 transformations. Numbers processed at a rate of 4592/second C:\test>\Perl22\bin\perl.exe 1172842.pl -N=5e7 -SAVE -LOAD > null 50000000 From 50000000 received there were: 33583555 new; 52246 +313 known; 0 deletions; 650237 insertions; 1281578 transformations. Numbers processed at a rate of 4767/second C:\test>\Perl22\bin\perl.exe 1172842.pl -N=5e7 -SAVE -LOAD > null 50000000 From 50000000 received there were: 30372628 new; 86174 +891 known; 0 deletions; 899052 insertions; 1715170 transformations. Numbers processed at a rate of 5344/second C:\test>\Perl22\bin\perl.exe 1172842.pl -N=5e7 -SAVE -LOAD > null 50000000 From 50000000 received there were: 27148533 new; 11871 +9247 known; 93783 deletions; 1103187 insertions; 2202499 transformati +ons. Numbers processed at a rate of 5900/second

    Note also that this code really loves having the system to itself -- the last couple of runs were overnight when the machine was otherwise idle -- and the performance jumps up from 4200/s to 5900/s; +40%.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: Finding Nearly Identical Sets (Updated:4200/sec)
by Limbic~Region (Chancellor) on Sep 29, 2016 at 13:42 UTC
    BrowserUk,

    Thanks - I only skim read the code but I think I understand how it works. Unfortunately, a Boolean of yes/no regarding if it has been seen before without being able to retrieve the near matches isn't going to be practical.

    I will check back in later.

    Cheers - L~R

      Unfortunately, a Boolean of yes/no regarding if it has been seen before without being able to retrieve the near matches isn't going to be practical.

      This is just the first, very fast, filter. The same method that generates *all the near matches* for *all* the known sets, in this filter, can be used again in a second pass, on individual near matches, to find the number(s) they match to.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk,
        True. I am not sure if that will be viable in the overall application or not. As I mentioned to you previously, I'm not sure an in-memory solution will work because of parallel processing. It is definitely food for thought.

        Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1172893]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2021-03-02 20:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favorite kind of desktop background is:











    Results (62 votes). Check out past polls.

    Notices?