Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

So, I had some fun with this one. There are two things you can optimize for here: space, and time. Let's look at both.

Space

You seem to want to optimize for space, so let's get that out of the way first with Devel::Size:

@set1 is an array of three keywords.

Sizes: \@set1: 112 bytes, 364 bytes total set_to_bits(@set1): 24 bytes, 24 bytes total set_to_string(@set1): 44 bytes, 44 bytes total

This size doesn't matter unless you are storing the list of keywords for all rows at once, rather than processing one row at a time, or at least just storing the Jaccard similarity (which as you know is a real number 0 < n < 1).

Time

I've considered three options: bitwise operations, string operations, and CPAN module Set::Similarity::Jaccard. Of those, the results are interesting:

Performance: Rate $jac->sim sim only jaccard_string jaccard_bits $jac->sim 63480/s -- -21% -62% -80% sim only 79868/s 26% -- -53% -75% jaccard_hash 168300/s 165% 111% -- -47% jaccard_bits 318317/s 401% 299% 89% --

"sim only" uses pre-computed arguments to $jac->similarity, so that number is pretty much the maximum speed one could expect from Set::Similarity::Jaccard. "$jac->sim" computes the string representation of both sets every time, so it's a little slower.

"jaccard_hash" and "jaccard_bits" are my own implementations which work directly with hashes or bitwise representations of each set, respectively.

Whether any of this is worth it or not is a question I can't answer for you. This code could still be optimized further, but I've already had all the fun I have time for today. :-)

Full code (~100 lines) is below the <readmore>. Error checking is up to you.

use 5.026; use strict; use warnings; use Benchmark qw/cmpthese/; use Devel::Size qw/size total_size/; use Set::Similarity::Jaccard; my @all = qw< alpha bravo charlie delta echo foxtrot >; my %mask; $mask{ $all[$_] } = 2 << $_ for 0..$#all; my %first; $first{$_} = substr $_, 0, 1 for @all; # For set_to_string my @set1 = qw< alpha charlie delta >; my @set2 = qw< alpha delta echo foxtrot >; # Convert a list to bits. e.g. qw<alpha charlie echo foxtrot> : 0b1010 +11 sub set_to_bits { my $bits = 0; $bits |= $mask{$_} for @_; $bits; } # Convert a list to a string representation. Note we can't just work w +ith the # full keywords here, as Set::Similarity::Jaccard would look at the ke +ywords # character-wise, which would give an incorrect result. # e.g.: qw<alpha charlie echo> : 'ace' sub set_to_string { join '', map { $first{$_} } @_; } # Compute Jaccard similarity of two sets, using hashes only sub jaccard_hash { my ($set1, $set2) = @_; my %set1 = map { $_ => 1 } @$set1; my %set2 = map { $_ => 1 } @$set2; my $int_count = 0; $int_count += 1 for grep { $set1{$_} && $set2{$_} } keys %set1; my $uni_count = 0; $uni_count += 1 for grep { $set1{$_} || $set2{$_} } @all; $uni_count ? $int_count/$uni_count : 1; } # Compute Jaccard similarity of two sets by converting to bits first ( +inlined) sub jaccard_bits { my ($set1, $set2) = @_; my ($bits1, $bits2) = (0,0); # Inline set_to_bits for speed $bits1 |= $mask{$_} for @$set1; $bits2 |= $mask{$_} for @$set2; # Calculate intersection and union my $int = $bits1 & $bits2; my $uni = $bits1 | $bits2; # Compute Hamming weights of $int and $uni to get # of bits set (i +nlined) my $ic = $int; $ic -= (($ic >> 1) & 0x55555555); $ic = ($ic & 0x33333333) + (($ic >> 2) & 0x33333333); $ic = ((($ic + ($ic >> 4)) & 0x0f0f0f0f) * 0x01010101) >> 24; my $uc = $uni; $uc -= (($uc >> 1) & 0x55555555); $uc = ($uc & 0x33333333) + (($uc >> 2) & 0x33333333); $uc = ((($uc + ($uc >> 4)) & 0x0f0f0f0f) * 0x01010101) >> 24; $uc ? $ic/$uc : 1; } # Verification my $jac = Set::Similarity::Jaccard->new; my $sim = $jac->similarity(set_to_string(@set1), set_to_string(@set2)) +; say "CPAN Jaccard: $sim"; say "Bits Jaccard: " . jaccard_bits(\@set1, \@set2); say "Hash Jaccard: " . jaccard_hash(\@set1, \@set2); printf "\nSizes:\n"; printf " %20s: %4d bytes, %4d bytes total\n", $_, eval "size($_)", eval "total_size($_)" for qw< \@set1 set_to_bits(@set1) set_to_string(@set1) >; printf "\nPerformance:\n"; my $str1 = set_to_string(@set1); my $str2 = set_to_string(@set2); my $r; cmpthese(-5, { 'similarity only' => sub { $r = Set::Similarity::Jaccard->new->similarity($str1, $str2) }, '$jac->similarity' => sub { $r = Set::Similarity::Jaccard->new->similarity(set_to_string(@set1) +, set_to_string(@set2) +) }, 'jaccard_bits' => sub { $r = jaccard_bits(\@set1, \@set2) }, 'jaccard_hash' => sub { $r = jaccard_hash(\@set1, \@set2) }, });
use strict; use warnings; omitted for brevity.

In reply to Re: bit array comparison by rjt
in thread bit array comparison by Amendil

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chanting in the Monastery: (8)
    As of 2021-04-13 16:51 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found

      Notices?