in reply to simple string comparison for efficiency

I was interested by the problem presented and wanted to play around with Inline C code in Strawberry Perl, so I came up with the code I have posted on AnomalousMonk's scratchpad.

As one would expect from a C-coded solution, it is faster than the Pure Perl bitwise string differencing approach of GrandFather in Re: simple string comparison for efficiency: about 1000% for strings of 24,000 bases, about 350% for 240,000 bases (on my laptop: YMMV).

It uses an approach that, by happy accident, works for base characters A, C, T and G, and the don't-care character N. It can be extended a bit for other 'bases' (I know other characters are used to represent common base sequences), but sooner or later it will break down. There is a similar (and as yet untested) approach that I can think of that could reliably handle a much wider range of 'bases', but it would require a separate (but I think fairly fast), character-substitution encoding step to convert input files to a form amenable to differencing.

Please take a look at the material on my scratchpad and, if you think it is of interest, I can post it in this thread or in a new thread as you think best.

  • Comment on Re: simple string comparison for efficiency

Replies are listed 'Best First'.
Re^2: simple string comparison for efficiency
by BrowserUk (Pope) on Jun 03, 2009 at 07:16 UTC

    Using XOR and a lookup for this in C code it kinda pointless.

    The reason using the XOR the technique works relatively quickly in Perl, is because accessing strings character by character in Perl is slow as it involves a function call (substr) for every character. And subroutine calls (even built-ins; they're both opcodes), are pretty slow in Perl.

    My bitwise XOR + tr technique allows you to compare all the bytes in the string using a single opcode. And do so in a way that leaves behind a detectable signature at non-matching byte positions (which other comparision opcodes do not do), which allows you to find where the differences are. Or just count the differences with another single opcode.

    But in C, accessing the bytes of a string is cheap. So, directly comparing the bytes against each other, and each against 'N', in a compound test with short-circuiting boolean operators:

    if( p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N' ) {

    will be quicker than xoring the bytes and then checking the results against a lookup table:

    if (c_map[ str1[i] ^ str2[i] ]) {

    A simple C implementation:

    int seqDiff ( SV *s1, SV *s2 ) { STRLEN bytes1, bytes2; char *p1, *p2; int i; p1 = SvPV( s1, bytes1 ); p2 = SvPV( s2, bytes2 ); if( bytes1 != bytes2 ) return 0; for( i = 0; i < bytes1; ++i ) { if( p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N' ) { return 0; } } return 1; }

    runs substantially faster than your xor + lookup especially on short strings:

    C:\test>SeqDiffDC_mark_1 comparing strings of length 240003 Rate bitmask inline buk bitmask 50.8/s -- -92% -95% inline 634/s 1146% -- -34% buk 967/s 1802% 53% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=10 comparing strings of length 243 Rate bitmask inline buk bitmask 25730/s -- -65% -93% inline 72894/s 183% -- -81% buk 393465/s 1429% 440% -- benchmark results expected

    However, we can do even better, especially on the long strings. On 32 & 64-bit cpus, accessing and manipulating byte registers is relatively slow compared to manipulating natural register-sized entities. By comparing the strings in 4 or 8-byte chunks, and only comparing bytes when a mismatch is detected, you can achieve further substantial reductions in the number of comparisons made. Using 32-bit compares:

    int seqDiff32 ( SV *s1, SV *s2 ) { STRLEN bytes1, bytes2; char *p1, *p2; STRLEN dwords; STRLEN lastBytes; U32 *up1, *up2; int i; p1 = SvPV( s1, bytes1 ); p2 = SvPV( s2, bytes2 ); if( bytes1 != bytes2 ) return 0; up1 = (U32*)p1; up2 = (U32*)p2; dwords = bytes1 >> 2; lastBytes = bytes1 % 4; p1 += dwords << 2; p2 += dwords << 2; while( dwords-- ) { if( *up1 != *up2 ) { for( i = 0; i < 4; ++i ) { char c1 = ( (char*)up1 )[ i ]; char c2 = ( (char*)up2 )[ i ]; if( c1 != c2 && c1 != 'N' && c2 != 'N' ) { return 0; } } } up1++; up2++; } for( i = 0; i < lastBytes; ++i ) { char c1 = ( (char*)p1 )[ i ]; char c2 = ( (char*)p2 )[ i ]; if( c1 != c2 && c1 != 'N' && c2 != 'N' ) { return 0; } } return 1; }
    C:\test>SeqDiffDC_mark_1 -LEN=1 comparing strings of length 27 Rate bitmask inline buk64 buk buk32 bitmask 43724/s -- -42% -93% -93% -93% inline 75272/s 72% -- -88% -88% -88% buk64 613799/s 1304% 715% -- -1% -6% buk 620545/s 1319% 724% 1% -- -5% buk32 651674/s 1390% 766% 6% 5% -- C:\test>SeqDiffDC_mark_1 -LEN=10 comparing strings of length 243 Rate bitmask inline buk buk64 buk32 bitmask 25730/s -- -65% -93% -94% -95% inline 72894/s 183% -- -81% -84% -86% buk 393465/s 1429% 440% -- -12% -22% buk64 447289/s 1638% 514% 14% -- -11% buk32 504399/s 1860% 592% 28% 13% -- benchmark results expected comparing strings of length 240003 Rate bitmask inline buk buk64 buk32 bitmask 50.9/s -- -92% -95% -97% -98% inline 629/s 1136% -- -35% -59% -75% buk 962/s 1790% 53% -- -38% -62% buk64 1545/s 2935% 146% 61% -- -39% buk32 2539/s 4889% 304% 164% 64% -- benchmark results expected

    I also tried a 64-bit version which intuition says should be faster still but that proved not to be so. AT first that confused me, but looking more closely at you test data I realised that the reason is that no single 8-byte compare will succeed--as your test strings have differences in every 8-byte unit: 'ATGCATCGATGN' vs. 'ATGCATNGATGN--and the 64-bit version has to compare 8 bytes for each failure rather than 4 with the 32-bit version. If I modify the strings so that they are substantially similar:

    my $seq1 = 'ATGCATCGATGN' x $LEN; my $seq2 = 'ATGCATCGATGN' x $LEN;

    then the 64-bit version wins hands down as expected:

    C:\test>SeqDiffDC_mark_1 -LEN=1 comparing strings of length 27 Rate bitmask inline buk buk32 buk64 bitmask 43145/s -- -43% -93% -94% -94% inline 76220/s 77% -- -88% -89% -89% buk 651920/s 1411% 755% -- -2% -2% buk32 664615/s 1440% 772% 2% -- -1% buk64 668000/s 1448% 776% 2% 1% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=10 comparing strings of length 243 Rate bitmask inline buk buk32 buk64 bitmask 25665/s -- -65% -94% -96% -96% inline 73346/s 186% -- -84% -88% -88% buk 450664/s 1656% 514% -- -23% -29% buk32 587556/s 2189% 701% 30% -- -7% buk64 632327/s 2364% 762% 40% 8% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=100 comparing strings of length 2403 Rate bitmask inline buk buk32 buk64 bitmask 4723/s -- -87% -96% -99% -99% inline 37224/s 688% -- -66% -88% -91% buk 108924/s 2206% 193% -- -66% -75% buk32 321561/s 6709% 764% 195% -- -26% buk64 432770/s 9063% 1063% 297% 35% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=1000 comparing strings of length 24003 Rate bitmask inline buk buk32 buk64 bitmask 506/s -- -91% -96% -99% -99% inline 5834/s 1054% -- -54% -90% -94% buk 12793/s 2431% 119% -- -77% -86% buk32 55949/s 10966% 859% 337% -- -41% buk64 94726/s 18637% 1524% 640% 69% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=10000 comparing strings of length 240003 Rate bitmask inline buk buk32 buk64 bitmask 51.3/s -- -92% -96% -99% -100% inline 635/s 1139% -- -51% -90% -94% buk 1294/s 2425% 104% -- -79% -88% buk32 6087/s 11774% 858% 370% -- -45% buk64 11162/s 21673% 1657% 762% 83% -- benchmark results expected

    /msg me if you want the modified version of your benchmark.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Most interesting...

      I realize I had naively (and unquestioningly) assumed that the look-up approach would allow one to effectively combine multiple operations into one, thus saving time. But now I see that
          p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N'
      will (assuming most characters are equal) boil down to something close to two indexed accesses on average, whereas
          c_map[ str1[i] ^ str2[i] ]
      will always involve three accesses. (I assume the costs of the  != and  ^ operations are about the same.)

      And that doesn't begin to address the fact that the look-up approach will not scale to register-wide operations!

      ... and enlightening.