http://www.perlmonks.org?node_id=500994


in reply to Hamming Distance Between 2 Strings - Fast(est) Way?

This code uses an XOR to compare the two strings. This is case sensitive and works when the two strings are the same length.
sub hd { return ($_[0] ^ $_[1]) =~ tr/\001-\255//; }

Replies are listed 'Best First'.
Re^2: Hamming Distance Between 2 Strings - Fast(est) Way?
by alpapan (Initiate) on Apr 22, 2011 at 23:47 UTC
    Hello I'm confused with the benchmark by Edward. The XOR code seems to be 3 times faster than the 'mine' using the following benchmark (shouldn't benchmark code always be published?)
    #!/usr/bin/perl -w use strict; my $s1 = 'AAAAA'; my $s2 = 'ATCAA'; for (my $i=0;$i<600000;$i++){ #choose one of the two methods #hd($s1,$s2); # real 0m1.401s hd2($s1,$s2); # real 0m0.405s } sub hd{ my ($k,$l) = @_; my $len = length ($k); my $num_mismatch = 0; for (my $i=0; $i<$len; $i++) { ++$num_mismatch if substr($k, $i, 1) ne substr($l, $i, 1); } return $num_mismatch; } sub hd2 { return ($_[0] ^ $_[1]) =~ tr/\001-\255//; }
      Duh! I misread Edwards' benchmark. missed that the benchmark was /s (instead of s) - shouldn't I be reading more carefully? :-)
        Here is something along the lines of what monkfan had probably written
        #!/usr/bin/perl -- use strict; use warnings; use Benchmark qw(cmpthese); our $data; print "\n$]\n"; { my $s1 = 'AAAAA'; my $s2 = 'ATCAA'; print join ' ', RoyJohnson500332($s1,$s2)," "; print join ' ', monkfan500235($s1,$s2)," "; print join ' ', BrowserUk500244($s1,$s2)," "; print join ' ', inman500994($s1,$s2)," "; print "\n"; } for my $range ( 5, 1_000 , 2_000 , 10_000 , 100_000 ){ $data = join '',map { ( qw' T A C G ' )[ $_ % 4 ] } 0 .. $range; my $s1 = $data.'AAAAA'; my $s2 = $data.'ATCAA'; print "## Length $range ", "##" x 11, "\n"; cmpthese (-3, { Mine => sub { monkfan500235($s1,$s2); return }, BUk => sub { BrowserUk500244($s1,$s2); return }, inman => sub { inman500994($s1,$s2); return }, RJ => sub { RoyJohnson500332($s1,$s2); return }, }); print "\n"; } sub monkfan500235 { my ($k,$l) = @_; my $len = length ($k); my $num_mismatch = 0; for (my $i=0; $i<$len; $i++) { ++$num_mismatch if substr($k, $i, 1) ne substr($l, $i, 1); } return $num_mismatch; } sub BrowserUk500244 { length( $_[ 0 ] ) - ( ( $_[ 0 ] ^ $_[ 1 ] ) =~ t +r[\0][\0] ) } sub RoyJohnson500332 { my ($k, $l) = @_; my $diff = $k ^ $l; my $num_mismatch = $diff =~ tr/\0//c; } sub inman500994 { return ($_[0] ^ $_[1]) =~ tr/\001-\255//; } __END__ 5.012002 2 2 2 2 ## Length 5 ###################### Rate Mine RJ BUk inman Mine 127809/s -- -78% -85% -87% RJ 580454/s 354% -- -33% -43% BUk 872298/s 582% 50% -- -14% inman 1012310/s 692% 74% 16% -- ## Length 1000 ###################### Rate Mine RJ BUk inman Mine 1862/s -- -99% -99% -99% RJ 176683/s 9389% -- -18% -21% BUk 216727/s 11540% 23% -- -4% inman 224899/s 11979% 27% 4% -- ## Length 2000 ###################### Rate Mine RJ BUk inman Mine 934/s -- -99% -99% -99% RJ 100661/s 10677% -- -17% -19% BUk 120656/s 12818% 20% -- -2% inman 123544/s 13127% 23% 2% -- ## Length 10000 ###################### Rate Mine RJ inman BUk Mine 187/s -- -99% -99% -99% RJ 22959/s 12181% -- -16% -18% inman 27435/s 14575% 19% -- -2% BUk 27976/s 14865% 22% 2% -- ## Length 100000 ###################### Rate Mine RJ inman BUk Mine 18.7/s -- -99% -99% -99% RJ 2085/s 11031% -- -21% -22% inman 2651/s 14054% 27% -- -1% BUk 2667/s 14136% 28% 1% --
      Sorry this is probably a really dumb question...but.. what is the ^ character doing in this subfunction? I know in regex it means match the first... and its used as a way to negate character class when between brackets...but I cant find its use like shown above.

        ^ in this case is bitwise-OR.

        When you bitwise-OR two strings, where the bytes in the two strings are the same, the byte in the resultant string will be chr(0); when they are different, the results will be non-chr(0). This allows you to count the number of differences in two strings very quickly:

        $seq1 = 'ACGTACGTACGTACGT';; $seq2 = 'TCGATCGATCGATCGA';; print ( $seq1 ^ $seq2 );; § §§ §§ §§ § print length( $seq1) - ( $seq1 ^ $seq2 ) =~ tr[\0][\0];; 8

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.