Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Finding Nearly Identical Sets

by LanX (Cardinal)
on Sep 28, 2016 at 19:56 UTC ( #1172862=note: print w/replies, xml ) Need Help??


in reply to Finding Nearly Identical Sets

IIRC these structures are called multisets because some elements are repeated in one of your examples.

If I understand your requirements correctly, you can use your approach in a pragmatic way, because any "neighboring" multi sets must have at least 8 digits in common.

So

  • Sort the 9 elements of the original multi set
  • Calculate all distinct 8 element subsets (at most 9)
  • Create hash keys by joining them
  • push the original set into a HoA ( or HoH) for each subset.

At the end you'll only need 9 hash look ups to drastically narrow down potential candidates.

NB: That's a pragmatic approach, a detailed survey might show more efficient algorithms.

HTH :)

PS: this problem reminds me of hamming distance of error correcting codes, but I doubt you can easily apply this here.

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

update

I just realized that you already sketched that approach in  Re^2: Finding Nearly Identical Sets . Not sure why you say it's ugly, cause a HoH should be quite fast, and you'd need to check anyway, if your input is equidistant to multiple neighbors.

Replies are listed 'Best First'.
Re^2: Finding Nearly Identical Sets
by Limbic~Region (Chancellor) on Sep 29, 2016 at 11:54 UTC
    LanX,

    Thanks!

    I was actually looking at Hilberts Curve. As far as the 9 subset/hash, I outlined that solution yesterday in a response to BrowserUk. I was hoping for something more elegant which is why I asked here.

    Cheers - L~R

      Again, for me it is elegant.

      It's fast, easy to implement and its matching the requirements.

      Well unless you have more in mind than you told us.

      Better solutions (like a more straight away normalization) would require your codepoints to have a guaranteed minimal distance.

      This could certainly be achieved, but well regarding the generality of your question, it'll be far fetched to suppose such a structure.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1172862]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2021-02-27 13:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?