in reply to Re: simple string comparison for efficiency
in thread simple string comparison for efficiency

Using XOR and a lookup for this in C code it kinda pointless.

The reason using the XOR the technique works relatively quickly in Perl, is because accessing strings character by character in Perl is slow as it involves a function call (substr) for every character. And subroutine calls (even built-ins; they're both opcodes), are pretty slow in Perl.

My bitwise XOR + tr technique allows you to compare all the bytes in the string using a single opcode. And do so in a way that leaves behind a detectable signature at non-matching byte positions (which other comparision opcodes do not do), which allows you to find where the differences are. Or just count the differences with another single opcode.

But in C, accessing the bytes of a string is cheap. So, directly comparing the bytes against each other, and each against 'N', in a compound test with short-circuiting boolean operators:

if( p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N' ) {

will be quicker than xoring the bytes and then checking the results against a lookup table:

if (c_map[ str1[i] ^ str2[i] ]) {

A simple C implementation:

int seqDiff ( SV *s1, SV *s2 ) { STRLEN bytes1, bytes2; char *p1, *p2; int i; p1 = SvPV( s1, bytes1 ); p2 = SvPV( s2, bytes2 ); if( bytes1 != bytes2 ) return 0; for( i = 0; i < bytes1; ++i ) { if( p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N' ) { return 0; } } return 1; }

runs substantially faster than your xor + lookup especially on short strings:

C:\test>SeqDiffDC_mark_1 comparing strings of length 240003 Rate bitmask inline buk bitmask 50.8/s -- -92% -95% inline 634/s 1146% -- -34% buk 967/s 1802% 53% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=10 comparing strings of length 243 Rate bitmask inline buk bitmask 25730/s -- -65% -93% inline 72894/s 183% -- -81% buk 393465/s 1429% 440% -- benchmark results expected

However, we can do even better, especially on the long strings. On 32 & 64-bit cpus, accessing and manipulating byte registers is relatively slow compared to manipulating natural register-sized entities. By comparing the strings in 4 or 8-byte chunks, and only comparing bytes when a mismatch is detected, you can achieve further substantial reductions in the number of comparisons made. Using 32-bit compares:

int seqDiff32 ( SV *s1, SV *s2 ) { STRLEN bytes1, bytes2; char *p1, *p2; STRLEN dwords; STRLEN lastBytes; U32 *up1, *up2; int i; p1 = SvPV( s1, bytes1 ); p2 = SvPV( s2, bytes2 ); if( bytes1 != bytes2 ) return 0; up1 = (U32*)p1; up2 = (U32*)p2; dwords = bytes1 >> 2; lastBytes = bytes1 % 4; p1 += dwords << 2; p2 += dwords << 2; while( dwords-- ) { if( *up1 != *up2 ) { for( i = 0; i < 4; ++i ) { char c1 = ( (char*)up1 )[ i ]; char c2 = ( (char*)up2 )[ i ]; if( c1 != c2 && c1 != 'N' && c2 != 'N' ) { return 0; } } } up1++; up2++; } for( i = 0; i < lastBytes; ++i ) { char c1 = ( (char*)p1 )[ i ]; char c2 = ( (char*)p2 )[ i ]; if( c1 != c2 && c1 != 'N' && c2 != 'N' ) { return 0; } } return 1; }
C:\test>SeqDiffDC_mark_1 -LEN=1 comparing strings of length 27 Rate bitmask inline buk64 buk buk32 bitmask 43724/s -- -42% -93% -93% -93% inline 75272/s 72% -- -88% -88% -88% buk64 613799/s 1304% 715% -- -1% -6% buk 620545/s 1319% 724% 1% -- -5% buk32 651674/s 1390% 766% 6% 5% -- C:\test>SeqDiffDC_mark_1 -LEN=10 comparing strings of length 243 Rate bitmask inline buk buk64 buk32 bitmask 25730/s -- -65% -93% -94% -95% inline 72894/s 183% -- -81% -84% -86% buk 393465/s 1429% 440% -- -12% -22% buk64 447289/s 1638% 514% 14% -- -11% buk32 504399/s 1860% 592% 28% 13% -- benchmark results expected comparing strings of length 240003 Rate bitmask inline buk buk64 buk32 bitmask 50.9/s -- -92% -95% -97% -98% inline 629/s 1136% -- -35% -59% -75% buk 962/s 1790% 53% -- -38% -62% buk64 1545/s 2935% 146% 61% -- -39% buk32 2539/s 4889% 304% 164% 64% -- benchmark results expected

I also tried a 64-bit version which intuition says should be faster still but that proved not to be so. AT first that confused me, but looking more closely at you test data I realised that the reason is that no single 8-byte compare will succeed--as your test strings have differences in every 8-byte unit: 'ATGCATCGATGN' vs. 'ATGCATNGATGN--and the 64-bit version has to compare 8 bytes for each failure rather than 4 with the 32-bit version. If I modify the strings so that they are substantially similar:

my $seq1 = 'ATGCATCGATGN' x $LEN; my $seq2 = 'ATGCATCGATGN' x $LEN;

then the 64-bit version wins hands down as expected:

C:\test>SeqDiffDC_mark_1 -LEN=1 comparing strings of length 27 Rate bitmask inline buk buk32 buk64 bitmask 43145/s -- -43% -93% -94% -94% inline 76220/s 77% -- -88% -89% -89% buk 651920/s 1411% 755% -- -2% -2% buk32 664615/s 1440% 772% 2% -- -1% buk64 668000/s 1448% 776% 2% 1% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=10 comparing strings of length 243 Rate bitmask inline buk buk32 buk64 bitmask 25665/s -- -65% -94% -96% -96% inline 73346/s 186% -- -84% -88% -88% buk 450664/s 1656% 514% -- -23% -29% buk32 587556/s 2189% 701% 30% -- -7% buk64 632327/s 2364% 762% 40% 8% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=100 comparing strings of length 2403 Rate bitmask inline buk buk32 buk64 bitmask 4723/s -- -87% -96% -99% -99% inline 37224/s 688% -- -66% -88% -91% buk 108924/s 2206% 193% -- -66% -75% buk32 321561/s 6709% 764% 195% -- -26% buk64 432770/s 9063% 1063% 297% 35% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=1000 comparing strings of length 24003 Rate bitmask inline buk buk32 buk64 bitmask 506/s -- -91% -96% -99% -99% inline 5834/s 1054% -- -54% -90% -94% buk 12793/s 2431% 119% -- -77% -86% buk32 55949/s 10966% 859% 337% -- -41% buk64 94726/s 18637% 1524% 640% 69% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=10000 comparing strings of length 240003 Rate bitmask inline buk buk32 buk64 bitmask 51.3/s -- -92% -96% -99% -100% inline 635/s 1139% -- -51% -90% -94% buk 1294/s 2425% 104% -- -79% -88% buk32 6087/s 11774% 858% 370% -- -45% buk64 11162/s 21673% 1657% 762% 83% -- benchmark results expected

/msg me if you want the modified version of your benchmark.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^3: simple string comparison for efficiency
by AnomalousMonk (Bishop) on Jun 03, 2009 at 09:42 UTC
    Most interesting...

    I realize I had naively (and unquestioningly) assumed that the look-up approach would allow one to effectively combine multiple operations into one, thus saving time. But now I see that
        p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N'
    will (assuming most characters are equal) boil down to something close to two indexed accesses on average, whereas
        c_map[ str1[i] ^ str2[i] ]
    will always involve three accesses. (I assume the costs of the  != and  ^ operations are about the same.)

    And that doesn't begin to address the fact that the look-up approach will not scale to register-wide operations!

    ... and enlightening.