Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Compact and sparse bit vector

by diotalevi (Canon)
on Dec 27, 2008 at 21:37 UTC ( #732843=CUFP: print w/replies, xml ) Need Help??

In a reply to Netflix (or on handling large amounts of data efficiently in perl), I suggested that Judy::1/Judy1(3) was a sparse bit vector which might be interesting as a replacement for vec(). Perl's built-in vec can also be used as a bit vector but it's not sparse - all bits between the zeroth to the highest ever set are instantiated inside a perl string. This is only convenient if your bit vector isn't too big in ram.

Example code

A simple perl bit vector

use constant MEGABYTE => 2 ** 20; my $vector = ''; # 5,000 bits. Set every 100th bit between 15 million and 20 million. for ( 15_000_000 .. 20_000_000 ) { if ( ! ( $_ % 100 ) ) { vec( $vector, $_, 1 ) = 1; } } printf "%0.1fM\n", length( $vector ) / MEGABYTE;

A simple Judy::1 bit vector

use constant KILOBYTE => 2 ** 10; use Judy::1 qw( Set MemUsed ); my $vec; # 5,000 bits. Set every 100th bit between 15 million and 20 million. for ( 15_000_000 .. 20_000_000 ) { if ( ! ( $_ % 100 ) ) { Set( $judy, $_ ); } } printf "%0.1fK\n", MemUsed( $judy ) / KILOBYTE;

Memory "benchmarks"

Tests using the entire range 0..20,000,000
Every bit2.3M618K
Tests using range 15,000,000..20,000,000
Every bit2.3M162K
Every 10th bit2.3M772K
Every 100th bit2.3M158K
Every 1000th bit2.3M65K

⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Replies are listed 'Best First'.
Re: Compact and sparse bit vector
by demerphq (Chancellor) on Jan 03, 2009 at 13:35 UTC

    The only thing is a lot of the time people use vec() specifically because they want the string to be a bit-vector. They dont want to use it as an array of <8 bit units of information so much, as a way to individually control bits in a string with a reasonably syntax.

    Although having said that, the problem with vec() in this form is that it doesnt use native byte order, so you cant normally pass it directly into a C routine. So i guess your proposal has some merit...


      Actually, the main use for bitvectors in the NetFlix challenge. is that you can do very fast relational math.

      So performing the select to find all the people who have rated all the same movies (and the movie we are trying to generate a rating for), that we already have a rating for by the person we are trying to generate a rating for, is just a tight loop over the dataset performing bitstring boolean operations which are very fast (in perl).

      Doing the same operations (AND/OR/XOR/etc. of every bit of one bitstring against every bit of another bitstring for every customer in the DB) using Judy arrays, would not benefit at all from any of their special properties (cache line locality), and be extremely slow.

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Heh, I kind zeroed in on the expression "was a sparse bit vector which might be interesting as a replacement for vec()", and didnt even notice the netflix reference. :-)


        I've never really found myself doing vector operations of bitmaps against each other. I didn't notice that being a feature of the linked Netflix node. I do occasionally find it convenient to have a sparse bitmask. Usually I just use a hash of the stringified integer.

        ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://732843]
Approved by Corion
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2022-06-27 12:13 GMT
Find Nodes?
    Voting Booth?
    My most frequent journeys are powered by:

    Results (88 votes). Check out past polls.