Re: Fast, Efficient Union and Intersection on arrays

That was an excellent exercise barvin. As always, the speed in the algorithm depends in the nature of your problem (the input).

All the algorithms presented above have theoretically a running time of O(m+n) where m,n are the sizes of the lists. Now, all the great ideas above differ much in the implementation with respect to the Perl internals and most are huge space consumers.

In terms of speed the best algorithm, is by far from BrowserUK(buk() function). In my Windows machine it is insanely fast compared to others. However, it has some important weaknesses. First, it can't deal with non-numeric data, since the vec needs a numeric offset which in buk() the list elements are used. Second, it fails with arrays with large numbers as element, e.g. if you set @a=(1,2) and @b=(3,4,10000000000) the program will die with the message Negative offset to vec in lvalue context at ....

On the other hand all other algorithms do not have the above weaknesses (can deal with large numbers, can handle non-numeric data), but still are much slower and also fail when arrays have duplicated elements.

In summary, I guess there is no absolute truth when we speak about algorithms. The algorithms here are running in O(m+n) which means that they are affected by the larger array. To beat this, use some high-order data structure,like the Heap. I didn't test it but if l=min(m,n) and r=max(m,n), you could reach O(l*log(r)). You have an extra overhead when inserting values(or removing) but still search is lightning fast.

Comment on Re: Fast, Efficient Union and Intersection on arrays Select or Download Code

Replies are listed 'Best First'.
Re^2: Fast, Efficient Union and Intersection on arrays by moritz (Cardinal) on Nov 20, 2008 at 23:29 UTC
All the algorithms presented above have theoretically a running time of O(m+n) where m,n are the sizes of the lists. Not true. Since BrowserUK's algorithm needs to allocate and initialize a string with `max(@a, @b)` bits, it has `O(n + m + max(@a) + max(@b))` run time (you mentioned another problem that arises from large numbers, but not in terms of complexity). Read more... (782 Bytes)	[reply] [d/l] [select]
Re^3: Fast, Efficient Union and Intersection on arrays by BrowserUk (Patriarch) on Nov 21, 2008 at 01:02 UTC
Whilst I agree that the algorithm is limited to numerics, (the triump of the specific--the OP's "positive integers"--over the generic), I disagree with your big O assessment. It perfectly feasible to pre-allocate the bitstrings to their largest possible size (2^32/8 = 512MB each). Easily and relativly quickly achieved on a fully populated 32-bit machine. But more importantly, it only need be done once regardless of the size of the input arrays, so it is an O(2) constant factor, and is disregarded by big-O. Of course, allocating huge chunks of ram for small input arrays means that the constant factor would be proportionally quite large, but that's the limitation of big-O. Conversely, the hashes used by the other algorithms grow by a process of repeated reallocation in geometrically increasing chunks, which means the time cost grows geometrically with size of the input arrays. In addition, the grep solutions have to build stack-bound lists from each of the arrays. So, if you are going to try and account for the memory allocation in assesment of the algorithms, then you would have to come up with something like: `O( M+N + SIGMA( 2^3+2^4+...+2^n>N) + SIGMA( 2^3+...+2^m >M ) + MIN(N,M +) ) memory for 1st hash memory for 2nd hash memory +for 1 array* ## (assuming you chose the smallest array to use with the 2hash/1 lis +t method)` [download] or `O( M+N + SIGMA( 2^3+2^4+...+2^n>N) + N + M) ) memory for smallest hash memory for the two arrays ## for the 1 hash/2 arrays method.` [download] All of which reenforces my stance that big-O is rarely done properly outside of academic papaers, and is of little use unless it is. A benchmark is more instructive. This run uses arrays of 1 million apiece, with the numbers in the range 0 .. 2^27: `C:\test>unionIntersect.pl -N=1e6 -MAX=27 s/iter barvin roy buk buk2 barvin 8.58 -- -43% -58% -64% roy 4.86 77% -- -26% -37% buk 3.59 139% 35% -- -15% buk2 3.05 182% 60% 18% --` [download] I couldn't go higher than 2^27 because of the need to have enough memory in the process to accommodate the large hashes and lists created by the other algorithms as well. Of course, you do have to go to quite large input arrays before that constant factor for allocating the bitstrings amortises: `C:\test>unionIntersect.pl -N=5e5 -MAX=27 s/iter barvin buk roy buk2 barvin 3.87 -- -28% -40% -50% buk 2.80 39% -- -17% -30% roy 2.33 66% 20% -- -16% buk2 1.95 98% 43% 19% --` [download] BTW: Buk2 simply uses RoyJohnstone's trick of only counting the bits for one solution and deriving the other number with arithmetic: `sub buk2{ my( $aRef, $bRef ) = @_; my( $aBits, $bBits ); for( $aBits, $bBits ) { local $\; open my $ram, '>', \$_; seek $ram, $MAX-1, 0; print $ram chr(0); close $ram; } vec( $aBits, $_, 1 ) = 1 for @$aRef; vec( $bBits, $_, 1 ) = 1 for @$bRef; my $iCount = unpack '%32b', ( $aBits & $bBits ); my $uCount = @$aRef + @$bRef - $iCount; $aBits = $bBits = ''; print 'buk2: ', $iCount, ' : ', $uCount if $SHOW; return( $iCount, $uCount ); }` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^4: Fast, Efficient Union and Intersection on arrays by ptoulis (Scribe) on Nov 21, 2008 at 02:40 UTC
I will be quick cuz it is 4am here. I was referring to running time for the big-O complexity. Space complexity is a different measure but even if we do count it, the complexity of the algorithms is still O(m+n). This is because the key idea in the algorithms is that we traverse the 1st array, create a hash and then test the hash against the 2nd array, which for arrays with m and n elements is pm+qn, p,q constants, or better O(m+n). If we were not to use hashes the running time would be O(mn) because we would have 2 nested loops. Also,the space allocations are not ~2^n but ~logN which is ignored in the O(m+n) notation. I will strongly disagree* with your statement about big-O notation being done only in academic papers. The notation does not tell you anything about the actual running time of the algorithm, but it is a hint on how the complexity scales with the input. Real benchmarks are most important but they do not replace mathematical analysis but supplement it. In simple terms O(mn) indicates 2 nested loops, while O(m+n) indicates 2 independent loops. That's it. There is no reference to actual time. Complexity theory is a real treasure when we can apply it properly. Actually, your method is a variation of a Bloom Filter* which is known for being very fast. To overcome the O(m+n) a special structure like a rooted tree(heap) is needed, because there is no better solution. This paper (PDF) is one of he many proofs that the union problem is linear on the input. Finally the new version `buk2()` may be faster but drops a key advantage of the previous, which is the ability to handle duplicate array elements. Still your algorithm is the killer one and I haven't think of anything better. That's all for now. I am going for some sleep and come back tomorrow, hopefully with some code on the problem. Till then I will use big-O notation to count the sheep!	[reply] [d/l]
Re^5: Fast, Efficient Union and Intersection on arrays by BrowserUk (Patriarch) on Nov 21, 2008 at 08:25 UTC
Re^6: Fast, Efficient Union and Intersection on arrays by ptoulis (Scribe) on Nov 21, 2008 at 11:53 UTC
Some notes below your chosen depth have not been shown here
Re^4: Fast, Efficient Union and Intersection on arrays by moritz (Cardinal) on Nov 21, 2008 at 07:16 UTC
It perfectly feasible to pre-allocate the bitstrings to their largest possible size That's cheating, in some sense. If you consider all limitation of real machines, you will find that perl can only ever handle a fixed, finite amount of RAM (if the pointers are 64 bit, it's `2**64`), every list is limited by that, so every of the mentioned algorithms runs in O(1). Just with a very big constant. That's why you usually rely on the model of an ideal machines (works with bigints, and arbitrarily much RAM) for the run time analysis.	[reply] [d/l]


There's more than one way to do things
	PerlMonks