in reply to bit array comparison

So, I had some fun with this one. There are two things you can optimize for here: space, and time. Let's look at both.


You seem to want to optimize for space, so let's get that out of the way first with Devel::Size:

@set1 is an array of three keywords.

Sizes: \@set1: 112 bytes, 364 bytes total set_to_bits(@set1): 24 bytes, 24 bytes total set_to_string(@set1): 44 bytes, 44 bytes total

This size doesn't matter unless you are storing the list of keywords for all rows at once, rather than processing one row at a time, or at least just storing the Jaccard similarity (which as you know is a real number 0 < n < 1).


I've considered three options: bitwise operations, string operations, and CPAN module Set::Similarity::Jaccard. Of those, the results are interesting:

Performance: Rate $jac->sim sim only jaccard_string jaccard_bits $jac->sim 63480/s -- -21% -62% -80% sim only 79868/s 26% -- -53% -75% jaccard_hash 168300/s 165% 111% -- -47% jaccard_bits 318317/s 401% 299% 89% --

"sim only" uses pre-computed arguments to $jac->similarity, so that number is pretty much the maximum speed one could expect from Set::Similarity::Jaccard. "$jac->sim" computes the string representation of both sets every time, so it's a little slower.

"jaccard_hash" and "jaccard_bits" are my own implementations which work directly with hashes or bitwise representations of each set, respectively.

Whether any of this is worth it or not is a question I can't answer for you. This code could still be optimized further, but I've already had all the fun I have time for today. :-)

Full code (~100 lines) is below the <readmore>. Error checking is up to you.

use 5.026; use strict; use warnings; use Benchmark qw/cmpthese/; use Devel::Size qw/size total_size/; use Set::Similarity::Jaccard; my @all = qw< alpha bravo charlie delta echo foxtrot >; my %mask; $mask{ $all[$_] } = 2 << $_ for 0..$#all; my %first; $first{$_} = substr $_, 0, 1 for @all; # For set_to_string my @set1 = qw< alpha charlie delta >; my @set2 = qw< alpha delta echo foxtrot >; # Convert a list to bits. e.g. qw<alpha charlie echo foxtrot> : 0b1010 +11 sub set_to_bits { my $bits = 0; $bits |= $mask{$_} for @_; $bits; } # Convert a list to a string representation. Note we can't just work w +ith the # full keywords here, as Set::Similarity::Jaccard would look at the ke +ywords # character-wise, which would give an incorrect result. # e.g.: qw<alpha charlie echo> : 'ace' sub set_to_string { join '', map { $first{$_} } @_; } # Compute Jaccard similarity of two sets, using hashes only sub jaccard_hash { my ($set1, $set2) = @_; my %set1 = map { $_ => 1 } @$set1; my %set2 = map { $_ => 1 } @$set2; my $int_count = 0; $int_count += 1 for grep { $set1{$_} && $set2{$_} } keys %set1; my $uni_count = 0; $uni_count += 1 for grep { $set1{$_} || $set2{$_} } @all; $uni_count ? $int_count/$uni_count : 1; } # Compute Jaccard similarity of two sets by converting to bits first ( +inlined) sub jaccard_bits { my ($set1, $set2) = @_; my ($bits1, $bits2) = (0,0); # Inline set_to_bits for speed $bits1 |= $mask{$_} for @$set1; $bits2 |= $mask{$_} for @$set2; # Calculate intersection and union my $int = $bits1 & $bits2; my $uni = $bits1 | $bits2; # Compute Hamming weights of $int and $uni to get # of bits set (i +nlined) my $ic = $int; $ic -= (($ic >> 1) & 0x55555555); $ic = ($ic & 0x33333333) + (($ic >> 2) & 0x33333333); $ic = ((($ic + ($ic >> 4)) & 0x0f0f0f0f) * 0x01010101) >> 24; my $uc = $uni; $uc -= (($uc >> 1) & 0x55555555); $uc = ($uc & 0x33333333) + (($uc >> 2) & 0x33333333); $uc = ((($uc + ($uc >> 4)) & 0x0f0f0f0f) * 0x01010101) >> 24; $uc ? $ic/$uc : 1; } # Verification my $jac = Set::Similarity::Jaccard->new; my $sim = $jac->similarity(set_to_string(@set1), set_to_string(@set2)) +; say "CPAN Jaccard: $sim"; say "Bits Jaccard: " . jaccard_bits(\@set1, \@set2); say "Hash Jaccard: " . jaccard_hash(\@set1, \@set2); printf "\nSizes:\n"; printf " %20s: %4d bytes, %4d bytes total\n", $_, eval "size($_)", eval "total_size($_)" for qw< \@set1 set_to_bits(@set1) set_to_string(@set1) >; printf "\nPerformance:\n"; my $str1 = set_to_string(@set1); my $str2 = set_to_string(@set2); my $r; cmpthese(-5, { 'similarity only' => sub { $r = Set::Similarity::Jaccard->new->similarity($str1, $str2) }, '$jac->similarity' => sub { $r = Set::Similarity::Jaccard->new->similarity(set_to_string(@set1) +, set_to_string(@set2) +) }, 'jaccard_bits' => sub { $r = jaccard_bits(\@set1, \@set2) }, 'jaccard_hash' => sub { $r = jaccard_hash(\@set1, \@set2) }, });
use strict; use warnings; omitted for brevity.

Replies are listed 'Best First'.
Re^2: bit array comparison
by Amendil (Novice) on Oct 22, 2019 at 17:46 UTC

    Very interesting!

    To be honest, I'm not really interested in space.
    The tsv I'm playing with has about 500k lines, and I can already run my algorithm with everything in ram.

    It's just that my naive implementation is slow, and I want to see how fast it can be. I know that I lost a lot of performance when I started comparing list of strings.
    I just expect bit array to be much much faster where applicable. To confirm that, I'll be able to compare with other similar columns that cannot be turned into bit array.

    Thank you for making me learn a bit more about Perl today.


    To reproduce what you did for Jaccard module, calling set_to_string first, I also did the same for your implementation.
    I makes sense in my actual context as I first read the whole file, cleaning rows, etc. so creating the bit array will be part of the process.

    What's interesting is that it doubled the performance hash: 340k/s -> 420k/s, bits: 800k/s -> 1.7m/s
    Also, using `my $ic = unpack '%b*', pack 'N', $int;` provides slightly better performance (+1% consistently) that your implementation with Hamming weighs.
    Now, I need to read a bit about Hamming weighs. Working with bits is always interesting, but it feels kind of a cipher.

      Oh yes, there's still more room for optimization if you need it. If you can't get the speed you need out of pure Perl, you can also use Inline::C or XS for maximum speed. If the Pure perl solution is good enough though, I'd stick with that. I love benchmarking; results still find a way to surprise me.