http://www.perlmonks.org?node_id=1067353


in reply to Re^3: Comparing two arrays
in thread Comparing two arrays

BrowserUk:

That's certainly true, for the original problem. I was intending to refute the statement "because one and only one vector will map to the same hash value", but I didn't take the rest of the thread into context. (I corrected the node accordingly.)

However, you needn't compare each X fully against each Y either, either. Just like your Bloom filter project a while ago, there may be ways to transform the problem so we don't have to explicitly compare vectors against each other. I've been working a bit on one, but I haven't posted it because the performance is currently worse than direct comparisons, and most of the changes I make to it slow it down even further.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^5: Comparing two arrays
by BrowserUk (Patriarch) on Dec 16, 2013 at 17:21 UTC

    Feel free to add your method to my benchmark. We can ignore the timings.

    If you set the iteration count to 1 (-I=1), then it will display the top 10 Ys matching each X along with the number of 1s they have in common. If you can match the results those three methods all concur on without doing the full X x Y x 15_000 bits comparison, you'll have proved your case:

    #! perl -slw use strict; use Benchmark qw[ cmpthese ]; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 500; our $I //= -1; our $N //= 1000; our @xArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. $N; our @yArrays = map[ map int( rand 2 ), 1 .. 15_000 ], 1 .. $N; our @xStrings = map{ join '', @$_ } @xArrays; our @yStrings = map{ join '', @$_ } @yArrays; our @xBits = map{ pack 'b*', $_ } @xStrings; our @yBits = map{ pack 'b*', $_ } @yStrings; cmpthese $I, { array => q[ my %top10s; for my $x ( 0 .. $#xArrays ) { for my $y ( 0 .. $#yArrays ) { my $count = 0; $xArrays[$x][$_] == 1 && $yArrays[$y][$_] == 1 and ++$ +count for 0 .. $#{ $xArrays[ 0 ] }; $top10s{"$x:$y"} = $count; my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke +ys %top10s )[ 0 ]; keys( %top10s ) > 10 and delete $top10s{$discard}; } } $I == 1 and pp ' arrays: ', %top10s; ], strings => q[ my %top10s; for my $x ( 0 .. $#xStrings ) { for my $y ( 0 .. $#yStrings ) { my $count = ( $xStrings[$x] & $yStrings[$y] ) =~ tr[1] +[]; $top10s{"$x:$y"} = $count; my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke +ys %top10s )[ 0 ]; keys( %top10s ) > 10 and delete $top10s{$discard}; } } $I == 1 and pp 'strings: ', %top10s; ], bits => q[ my %top10s; for my $x ( 0 .. $#xBits ) { for my $y ( 0 .. $#yBits ) { my $count = unpack '%32b*', ( $xBits[$x] & $yBits[$y] +); $top10s{"$x:$y"} = $count; my $discard = ( sort{ $top10s{$a} <=> $top10s{$b} } ke +ys %top10s )[ 0 ]; keys( %top10s ) > 10 and delete $top10s{$discard}; } } $I == 1 and pp ' bits: ', %top10s; ], }; __END__ C:\test>1067218 -I=1 -N=100 (" arrays: ", "44:16", 3911, "23:58", 3913, "78:4", 3907, "54:24", 390 +9, "10:16", 3929, "78:24", 3928, "23:16", 3920, "23:24", 3922, "58:56 +", 3917, "54:58", 3914) (warning: too few iterations for a reliable count) (" bits: ", "44:16", 3911, "23:58", 3913, "78:4", 3907, "54:24", 390 +9, "10:16", 3929, "78:24", 3928, "23:16", 3920, "23:24", 3922, "58:56 +", 3917, "54:58", 3914) (warning: too few iterations for a reliable count) ("strings: ", "44:16", 3911, "23:58", 3913, "78:4", 3907, "54:24", 390 +9, "10:16", 3929, "78:24", 3928, "23:16", 3920, "23:24", 3922, "58:56 +", 3917, "54:58", 3914) (warning: too few iterations for a reliable count) Rate array strings bits array 1.98e-002/s -- -98% -100% strings 1.12/s 5574% -- -82% bits 6.41/s 32272% 471% --

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    div

      BrowserUk:

      OK, I've added my routine to your benchmark program, wrapped in readmore tags below.

      I've had to add and edit a bit of code outside the actual compared routines. The OP mentioned that the ratio of 1 to 0 entries is 1::500, which is a fact I used to come up with my approach. So the first change is the ability to set the probability of 1 bits.

      Since the 1 bits are sparse, rather than making an explicit vector, I use a list of the positions of the 1 bits. In order to come up with the same results, I converted the Y vectors list format.

      The biggest change outside of the comparison routine is the setup routine that transforms the xArray. The concept was to do something like this: Build an artificial set of vectors each with 1 bit--one vector per bit position. Then we compare each of these artificial vectors against the xArray set, resulting in a list of x vectors for each bit position. Then in our comparison, we aggregate the selected bins. Thus, if y has five bits in it, we add in the five partial products from the eigenset vectors. So the process of building the lists is amortized over the run of comparisons.

      Having said all that, here it is. As mentioned previously, I came up with my approach when I saw that the distribution of 1s was very sparse. As the density of 1s increases, the routine gets progressively slower.

      $ perl 1067357_mcm.pl -I=1 -N=100 -W=4000 -P=.05 <<< snipped >>> Rate array strings bits robo array 2.33e-02/s -- -98% -99% -100% strings 1.45/s 6122% -- -25% -72% bits 1.92/s 8156% 33% -- -63% robo 5.26/s 22495% 263% 174% -- $ perl 1067357_mcm.pl -I=1 -N=100 -W=4000 -P=.5 <<< snipped >>> s/iter array robo strings bits array 60.4 -- -87% -99% -99% robo 7.75 680% -- -91% -93% strings 0.690 8659% 1023% -- -23% bits 0.530 11304% 1362% 30% --

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        I came up with my approach when I saw that the distribution of 1s was very sparse. As the density of 1s increases, the routine gets progressively slower.

        None the less, the challenge is met and I stand corrected.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^5: Comparing two arrays
by BrowserUk (Patriarch) on Dec 16, 2013 at 16:30 UTC
    However, you needn't compare each X fully against each Y either, either. Just like your Bloom filter project a while ago, there may be ways to transform the problem so we don't have to explicitly compare vectors against each other.

    I really don't believe that is possible. Please prove me wrong?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.