Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Partition in to 63 parts

by gam3 (Curate)
on Jan 17, 2010 at 21:43 UTC ( #817901=note: print w/replies, xml ) Need Help??

in reply to Need a faster way to find matches

The secret is that there are only 63 bits and that any number with a certain bit set can not 'match' any number with that same bit set.

So all you need to do is get a list of all the elements with a bit set and compare them with all of the numbers with out that bit set.

By using a hash in place of the list we can do this pretty efficiently.

  1. Create the hash from the list.
  2. get a list of all the numbers with the 2**1 bit set.
  3. remove those numbers from the hash.
  4. compare them against the numbers in the hash
  5. repeat for 2**2 through 2**63
You do need to fix up the generated hash afterwards, as it will on have half of the keys in it.
use strict; use vars qw (@LongListOfIntegers %bighash %MatchedIntegers); my $x = 20; push( @LongListOfIntegers, int( rand( 2**8 ) ) ) while ( $x-- ); $bighash{$_} = 0 for @LongListOfIntegers; for my $bit (1..63) { my $n = 2 ** $bit; my @slist = grep( { ($_ & $n) } keys %bighash); delete $bighash{$_} for @slist; my $x = 0; for my $i (@slist) { for my $j (keys %bighash) { $MatchedIntegers{$i}{$j}++ if ( ( $i & $j ) == 1 ); $x++ if ( ( $i & $j) == 1 ); } } } normalize(\%MatchedIntegers); use Data::Dumper; print Dumper \%MatchedIntegers; sub normalize { my $hash = shift; for my $o (keys %{$hash}) { for my $i (keys %{$hash->{$o}}) { $hash->{$i}{$o}++; } } }
note: It is also easy to split the problem up into 63 problems and use POE or some such to run each on a separate processor -- if you have 63 processors on your computer.
-- gam3
A picture is worth a thousand words, but takes 200K.

Replies are listed 'Best First'.
Re: Partition in to 63 parts
by remzak (Acolyte) on Jan 18, 2010 at 00:59 UTC
    Partitioning the problem is a really cool idea, especially since it shows how to easily split up the processing (if only I could). Also I appreciate your code. When I plugged it into my program, it wasn't faster than my starting point. I am guessing it would be faster if my list was larger than 10,000, or if I had a computer with many processors running the algorithm in parallel. At the current size of my list (slightly less than 4,000) the current optimization is the fastest. Just as an FYI, with all the stimulating ideas that were shared, I managed to improve the speed by more than 47 percent.
      Yes it's only real advantage is that you could stick it in POE and use all your cores.

      If the data is not random you might find some advantage in removing the bit that is set the most first.

      If the input data has some bits set more than 1/63 of the time you can get some advantage. You can see this in that if all of the data had a bit set then this algorithm would remove all the entries from the list on the first pass if it checked that bit first.

      But I assume your data is random.

      -- gam3
      A picture is worth a thousand words, but takes 200K.
        The data isn't random... the numbers come from another algorithm. Finding the pairs is the middle step of a three step algorithm.

        BrowserUK provided a great example of using threads. When I tried your approach, I used too many threads and learned how much overhead threads require. I'll try to merge his partioning approach with my code... I'm finding that after six threads, the performance degrades (on my machine).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://817901]
[LanX]: what?
[LanX]: $; exists, so does the glob
[GotToBTru]: $; is "useful for emulating multi-dimensional arrays in Perl4"
[LanX]: i.e. &; is parsable
[LanX]: it's sometimes useful in P5 too

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2017-01-18 21:04 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (164 votes). Check out past polls.