http://www.perlmonks.org?node_id=1067233


in reply to Re: Comparing two arrays
in thread Comparing two arrays

you can sort the X arrays first, then do a binary search to find the closest X for each Y

Sort by what criteria? "closest" by what criteria?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: Comparing two arrays
by educated_foo (Vicar) on Dec 15, 2013 at 14:25 UTC
    Sort by what criteria? "closest" by what criteria?
    That obviously depends upon what the OP is trying to do. Maybe he wants lexicographic sorting, or maybe some other mapping; it's not specified in the question, so I didn't guess.
      yea , probably i wasn't explicit about that. but if i'm looking for the top 10 matches, this implies that i would first need to do a comparison each x with each y to generate a some scoring scheme which i would then need to use as my key for sorting but that beats the purpose. I could to triangulation between x's and then "sort" them into buckets so that x's form different buckets with no 1's in common, then i could say if my x1<=>y has 30 1's in common i know that x2<=>y will have at most (total 1's) - 30 and then go from that. but in the end it usually turns out that this actually slows the whole thing down rather then speeds it up.

      As far as bit-strings go. I just implemented the version of my program in perl and speed has amazingly increased, however my c++ version shows a slight decrease in runtime performance when compared to version where i used character arrays instead. I guess this is tied to optimization , maybe (or my bad implementation - which is most probably the true reason) but i guess this is more or less the max that i can get out of my machine (with respect to my programming abilities).

      However, I cannot shake this feeling that due to fact that i need only top 10 matches, maybe i could somehow use this to implement a good heuristics. I just cannot wrap my mind around the problem.

      anyway,

      thank you

      baxy

        however my c++ version shows a slight decrease in runtime performance when compared to version where i used character arrays instead. I guess this is tied to optimization , maybe (or my bad implementation -

        You could post the (relevant bit of) the C++ code. Maybe we could see something to help...


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I just re-read your original post, and was reminded of this:
        there are about 500 times less 1's then 0's
        So for the average item, only 30 features are true, which changes things. You still can't do an all-to-all pairwise comparison in any reasonable amount of time, but you may be able to e.g. recursively partition your data by the feature closest to a 50/50 distribution, then find closest matches once you cut the comparison set down to something reasonable. If something simple like that won't work, you should look into dimensionality reduction, e.g. via random projection or PCA.