Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Comparing two arrays

by sundialsvc4 (Abbot)
on Dec 16, 2013 at 13:05 UTC ( [id://1067326]=note: print w/replies, xml ) Need Help??


in reply to Comparing two arrays

Here’s another way to do it:   scan the arrays to convert them to a vector using some form of “run-length encoding.”   For instance, a list containing the position of the next '1' and the number of '1's that occur.   Now for the magic:   use a Digest algorithm of some sort ... could be CRC, SHA1, could be anything ... to convert that vector into a single value that can be used as a hash.

Actually, if you use a truly-decent message digest algorithm, like SHA1, you probably won’t have to do anything else, because one and only one vector will map to the same hash value.   You have to preprocess each array one time to compute its message-digest hash, but from there on out you might not have to examine the array contents at all.   A comparison of the digests ought to be sufficient.   Aside from the overhead of passing through each array one time to calculate its digest, the overhead of laborious comparison ... disappears.

Replies are listed 'Best First'.
Re^2: Comparing two arrays
by roboticus (Chancellor) on Dec 16, 2013 at 13:43 UTC

    sundialsvc4:

    If you're looking for duplicate vectors, digesting can greatly reduce the number of comparisons, but it won't let you escape them altogether: http://en.wikipedia.org/wiki/Pigeonhole_principle.

    Update: s/While/If you're looking for duplicate vectors/ and changed conjunction so the sentence still reads well.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      While digesting can greatly reduce the number of comparisons

      That would only be true if the OP was looking for exact matches. He isn't.

      He's looking for the best matches, where 'best' is defined in terms of the number of set bits in matching positions. No hashing, digesting nor sorting approach to this problem is possible.

      Every X must be fully compared against every Y.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        BrowserUk:

        That's certainly true, for the original problem. I was intending to refute the statement "because one and only one vector will map to the same hash value", but I didn't take the rest of the thread into context. (I corrected the node accordingly.)

        However, you needn't compare each X fully against each Y either, either. Just like your Bloom filter project a while ago, there may be ways to transform the problem so we don't have to explicitly compare vectors against each other. I've been working a bit on one, but I haven't posted it because the performance is currently worse than direct comparisons, and most of the changes I make to it slow it down even further.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re^2: Comparing two arrays
by choroba (Cardinal) on Dec 16, 2013 at 13:21 UTC
    one and only one vector will map to the same hash value
    Do you mean there are no hash collisions for SHA1?
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re^2: Comparing two arrays
by BrowserUk (Patriarch) on Dec 16, 2013 at 15:23 UTC
    the overhead of laborious comparison ... disappears.

    Total garbage.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re^2: Comparing two arrays
by GotToBTru (Prior) on Dec 31, 2013 at 21:38 UTC

    This method was considered and discarded in the OP:

    I was thinking about computing a hash key (a checksum of come sort) but then i would be able to identify only identical (x,y) pairs.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1067326]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2024-04-23 16:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found