http://www.perlmonks.org?node_id=11107846

Amendil has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks,

I'm working on a tsv, one of its columns is a csv list of keywords (28 unique values). I'd like to compute the Jaccard Index (Intersection / Union) of this list of keywords. To do so efficiently I'd like to use a bit array to represent the list of keywords.

I tried to read few articles on Perlmonks and stackoverflow, but so far I feel I'm missing something completely obvious.

Here is what I wrote:

```use common::sense;

my \$a = '';
my \$b = '';

\$a += 1 << 0;
\$a += 1 << 1;

\$b += 1 << 1;
\$b += 1 << 2;

my \$i = \$a & \$b;
my \$u = \$a | \$b;

my \$i_cnt = unpack '%32b*', \$i;
my \$u_cnt = unpack '%32b*', \$u;

printf "a is %#032b  %d\n", \$a, \$a;
printf "b is %#032b  %d\n", \$b, \$b;
printf "intersection is %#032b  %d\n", \$i, \$i;
printf "union is %#032b  %d\n", \$u, \$u;
say "set bit count in intersection: \$i_cnt";
say "set bit count in union: \$u_cnt";

Actual result:

```a is 0b000000000000000000000000000011  3
b is 0b000000000000000000000000000110  6
intersection is 0b000000000000000000000000000010  2
union is 0b000000000000000000000000000111  7
set bit count in intersection: 3
set bit count in union: 5

Expected result:

```a is 0b000000000000000000000000000011  3
b is 0b000000000000000000000000000110  6
intersection is 0b000000000000000000000000000010  2
union is 0b000000000000000000000000000111  7
set bit count in intersection: 1
set bit count in union: 3

Replies are listed 'Best First'.
Re: bit array comparison
by rjt (Curate) on Oct 22, 2019 at 16:32 UTC

So, I had some fun with this one. There are two things you can optimize for here: space, and time. Let's look at both.

## Space

You seem to want to optimize for space, so let's get that out of the way first with Devel::Size:

@set1 is an array of three keywords.

```Sizes:
\@set1:  112 bytes,  364 bytes total
set_to_bits(@set1):   24 bytes,   24 bytes total
set_to_string(@set1):   44 bytes,   44 bytes total

This size doesn't matter unless you are storing the list of keywords for all rows at once, rather than processing one row at a time, or at least just storing the Jaccard similarity (which as you know is a real number 0 < n < 1).

## Time

I've considered three options: bitwise operations, string operations, and CPAN module Set::Similarity::Jaccard. Of those, the results are interesting:

```Performance:
Rate \$jac->sim sim only jaccard_string jaccard_bits
\$jac->sim      63480/s        --     -21%           -62%         -80%
sim only       79868/s       26%       --           -53%         -75%
jaccard_hash  168300/s      165%     111%             --         -47%
jaccard_bits  318317/s      401%     299%            89%           --

"sim only" uses pre-computed arguments to \$jac->similarity, so that number is pretty much the maximum speed one could expect from Set::Similarity::Jaccard. "\$jac->sim" computes the string representation of both sets every time, so it's a little slower.

"jaccard_hash" and "jaccard_bits" are my own implementations which work directly with hashes or bitwise representations of each set, respectively.

Whether any of this is worth it or not is a question I can't answer for you. This code could still be optimized further, but I've already had all the fun I have time for today. :-)

Full code (~100 lines) is below the <readmore>. Error checking is up to you.

use strict; use warnings; omitted for brevity.

Very interesting!

To be honest, I'm not really interested in space.
The tsv I'm playing with has about 500k lines, and I can already run my algorithm with everything in ram.

It's just that my naive implementation is slow, and I want to see how fast it can be. I know that I lost a lot of performance when I started comparing list of strings.
I just expect bit array to be much much faster where applicable. To confirm that, I'll be able to compare with other similar columns that cannot be turned into bit array.

Thank you for making me learn a bit more about Perl today.

edit:

To reproduce what you did for Jaccard module, calling set_to_string first, I also did the same for your implementation.
I makes sense in my actual context as I first read the whole file, cleaning rows, etc. so creating the bit array will be part of the process.

What's interesting is that it doubled the performance hash: 340k/s -> 420k/s, bits: 800k/s -> 1.7m/s
Also, using `my \$ic = unpack '%b*', pack 'N', \$int;` provides slightly better performance (+1% consistently) that your implementation with Hamming weighs.
Now, I need to read a bit about Hamming weighs. Working with bits is always interesting, but it feels kind of a cipher.

Oh yes, there's still more room for optimization if you need it. If you can't get the speed you need out of pure Perl, you can also use Inline::C or XS for maximum speed. If the Pure perl solution is good enough though, I'd stick with that. I love benchmarking; results still find a way to surprise me.

Re: bit array comparison
by ikegami (Pope) on Oct 22, 2019 at 14:19 UTC

The problem is that unpack "b*" takes a string of bytes.

And so does unpack "B*", what you'd use to get the bits in the same order as sprintf "%b".

```\$ perl -e'CORE::say unpack "B*", pack "L>", 3'
00000000000000000000000000000011

Tip: you should use | rather than + to set a bit (in case it's already set).

```\$b |= 1 << 2;
Re: bit array comparison
by tybalt89 (Prior) on Oct 22, 2019 at 16:50 UTC
```#!/usr/bin/perl

use common::sense; # https://perlmonks.org/?node_id=11107846

my \$a = 0;
my \$b = 0;

\$a |= 1 << 0;
\$a |= 1 << 1;

\$b |= 1 << 1;
\$b |= 1 << 2;

my \$i = \$a & \$b;
my \$u = \$a | \$b;

my \$i_cnt = unpack '%32b*', pack 'N', \$i;
my \$u_cnt = unpack '%32b*', pack 'N', \$u;

printf "a is %#032b  %d\n", \$a, \$a;
printf "b is %#032b  %d\n", \$b, \$b;
printf "intersection is %#032b  %d\n", \$i, \$i;
printf "union is %#032b  %d\n", \$u, \$u;
say "set bit count in intersection: \$i_cnt";
say "set bit count in union: \$u_cnt";

Oh, we can also use pack that way, so that it formats it just as unpack wants it.

Thank you, I know understand more the different example I could see elsewhere.

Re: bit array comparison
by rsFalse (Hermit) on Oct 22, 2019 at 14:08 UTC
Hello, Amendil,

It looks like perl's unpack interpreted \$i and \$u as ASCII chars:
```perl -le 'printf "[%b]", ord for 2, 7'
OUTPUT:
```[110010][110111]

Ok. unpack expects a string, but from other example I thought it would work.

My ref were: https://www.perlmonks.org/?node_id=1015564 and https://www.oreilly.com/library/view/mastering-perl/9780596527242/ch16.html

I will bench the perf of the following.

```use common::sense;

my \$a = 0;
my \$b = 0;

\$a += 1 << 0;
\$a += 1 << 1;

\$b += 1 << 1;
\$b += 1 << 2;

my \$i = \$a & \$b;
my \$u = \$a | \$b;

my \$i_cnt = () = sprintf("%b", \$i) =~ /1/g;
my \$u_cnt = () = sprintf("%b", \$u) =~ /1/g;

printf "a is %b  %d\n", \$a, \$a;
printf "b is %b  %d\n", \$b, \$b;
printf "intersection is %b  %d\n", \$i, \$i;
printf "union is %b  %d\n", \$u, \$u;
say "set bit count in intersection: \$i_cnt";
say "set bit count in union: \$u_cnt";
```my \$i_cnt = () = sprintf("%b", \$i) =~ /1/g;
```my \$i_cnt = sprintf('%b', \$i) =~ tr/1//;