Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: Finding Nearly Identical Sets (Updated:4200/sec)

by Limbic~Region (Chancellor)
on Sep 29, 2016 at 13:42 UTC ( #1172926=note: print w/replies, xml ) Need Help??


in reply to Re: Finding Nearly Identical Sets (Updated:4200/sec)
in thread Finding Nearly Identical Sets

BrowserUk,

Thanks - I only skim read the code but I think I understand how it works. Unfortunately, a Boolean of yes/no regarding if it has been seen before without being able to retrieve the near matches isn't going to be practical.

I will check back in later.

Cheers - L~R

  • Comment on Re^2: Finding Nearly Identical Sets (Updated:4200/sec)

Replies are listed 'Best First'.
Re^3: Finding Nearly Identical Sets (Updated:4200/sec)
by BrowserUk (Pope) on Sep 29, 2016 at 14:08 UTC
    Unfortunately, a Boolean of yes/no regarding if it has been seen before without being able to retrieve the near matches isn't going to be practical.

    This is just the first, very fast, filter. The same method that generates *all the near matches* for *all* the known sets, in this filter, can be used again in a second pass, on individual near matches, to find the number(s) they match to.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      True. I am not sure if that will be viable in the overall application or not. As I mentioned to you previously, I'm not sure an in-memory solution will work because of parallel processing. It is definitely food for thought.

      Cheers - L~R

        I'm not sure an in-memory solution will work because of parallel processing.

        Hm. The primary reason -- there are others -- for using parallel processing is: speed.

        I pretty much guarantee that you will not be able to achieve 500/s using a disk-based file or DB let alone 5000/s; -- disk access is at least 100,000 times slower than memory -- which means you now need 10 processors instead on one just to get back to par.

        And if 5000/s isn't enough? Put the bitmaps in shared memory (NOT threads::shared) and run multiple threads...

        Anyway, good luck with the project which ever way you choose to go :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1172926]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2018-08-16 09:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Asked to put a square peg in a round hole, I would:









    Results (167 votes). Check out past polls.

    Notices?