http://www.perlmonks.org?node_id=766743

CaptainF has asked for the wisdom of the Perl Monks concerning the following question:

Here is an easy one for all of you gurus. There is a premium on efficiency as I need to conduct the following operation on approximately 1 X 10^8 comparisons. I need to evaluate if a pair of strings of the same length are identical after excluding all positions that have an 'N' in one or both strings. Both strings will always consist of the character set ATGCN. For example, the comparison of $a and $b would meet my criterion: $a = 'ATGNCNC'; $b = 'ATGACNN'; But $c and $d would not: $c = 'ATGNCNC'; $d = 'TTGNNNC'; (because the first character differs at a position that does not contain an 'N' in either string) Again, there is an extreme premium on efficiency here. Any thoughts?
  • Comment on simple string comparison for efficiency

Replies are listed 'Best First'.
Re: simple string comparison for efficiency
by GrandFather (Saint) on May 28, 2009 at 21:05 UTC

    Tell us about the larger picture. There is likely to be room for improving whatever you are doing at present, or at least for us to target solutions to whatever the bigger problem is. Most gains in performance are not obtained by micro-optimizations such as you are asking for. Some tricks that will give huge performance gains in some situations will cripple the code in other situations.

    That said, you can build masks then xor the strings to give something like the fast match you are looking for where the strings to be matched are fairly long. Consider:

    use strict; use warnings; my $strA = 'ATGNCNC'; my $strB = 'ATGACNN'; my $strC = 'TTGACNN'; print $strA, (match ($strA, $strB) ? ' eq' : ' ne'), " $strB\n"; print $strA, (match ($strA, $strC) ? ' eq' : ' ne'), " $strC\n"; sub match { my ($mask1, $mask2) = @_; my ($str1, $str2) = @_; $mask1 =~ tr/NATGC/0\xFF/; $mask2 =~ tr/NATGC/0\xFF/; $mask1 &= $mask2; $str1 ^= $str2; $str1 &= $mask1; return $str1 !~ /[^\x00]/; }

    Prints:

    ATGNCNC eq ATGACNN ATGNCNC ne TTGACNN

    If you can cache the masks (say you were matching all strings against all others for example) then you get a greater gain.


    True laziness is hard work
      Grandfather, your solution using bitwise operators would not have occurred to me, but was exactly what I needed. It solved the problem several orders of magnitude faster than my solution. Is there a simple way to extract the number of string positions where one or both strings had an 'N' from your code?

        Add the line:

        my $nCount = ($mask1 =~ tr/N//) + ($mask2 =~ tr/N//);

        as the third line of sub match.


        True laziness is hard work
Re: simple string comparison for efficiency
by Corion (Patriarch) on May 28, 2009 at 20:47 UTC

    Searching this site for recommendations of BioPerl and/or Fasta will likely give you lots of results. For example, Lower-casing Substrings and Iterating Two Files together has a problem very similar to yours and also a solution. Of course, if the "extreme premium on efficiency" you cite will be backed up by an extreme premium of money, you can achieve better results.

      Searching this site for recommendations of BioPerl...

      If you are looking for an efficient solution, stay away from BioPerl

      citromatik

Re: simple string comparison for efficiency
by roboticus (Chancellor) on May 29, 2009 at 14:59 UTC
    CaptainF:

    I think the first thing I'd do is to break the problem into a smaller set of problems to reduce the number of comparisons to make. If you have 10,000 strings to compare, a brute force approach has you making about 1x10^8 comparisons (roughly N^2). But if you can break your files into M roughly equal groups, you've already reduced the comparisons to roughly M * (N/M)^2. For M=32, it would be about 3.13x10^6 comparisons, nearly two orders of magnitude. Breaking your data set into multiple small groups will give you a big time savings.

    (Yes, it's hand-wavy and I made no attempt to be rigorous. Obviously a degenerate case wouldn't do you any good, and some files will appear in multiple groups, of course. But I'm guessing that your data won't fall into the worst case. If you want rigor, please send a check for the appropriate amount, and I'll take care of it--Heh, heh!)

    An approach something like the following might be able to break your list of files into plenty of groups:

    Input: @FileNames : A list of candidate files Step 1: Build a hash to group potentially similar strings { local $/ = \32; for my $curFile (@FileNames) { open INF, '<', $curFile or die... my $key = <INF>; close INF or die... # list of files matching the first 32 chars push @{$Candidates{$key}{LIST}}, $curFile; # regex for matching similar groups my $tmp = $key; $tmp =~ s/N/./; $Candidates{$key}{REGEX} = qr/$r/; } } Step 2: Merge compatible groups for my $key (keys %Candidates) { $Candidates{$key}{SIMILAR} = grep { /$Candidates{$key}{REGEX}/ && ($_ ne $key) } keys %Candidates; } Step 3: Remove unmatchable items my %UnmatchableKeys; for my $key (keys %Candidates) { next if exists $Candidates{$key}{SIMILAR}; next if scalar @{$Candidates}{$key}{LIST}} > 1; $UnmatchableKeys{$key} = 1; } Step 4: Generate lists of items needing to be compared for my $key (keys %Candidates} { next if exists $UnmatchableKeys{$key}; my @similarFiles = (@{$Candidates}{$key}{LIST}}); for my $others (keys %{$Candidates{$key}{SIMILAR}}) { push @similarFiles, @{$Candidates{$key}{LIST}}; } print "GROUP: ", join(", ", sort @similarFiles), "\n"; }

    (Note: Horribly untested, and I probably botched a good bit of the syntax.)

    There some improvements you can make, such as removing identical groups, etc. You can also apply this recursively, taking the next N bytes and further partitioning your sets until you run out of text.

    Another approach you might take is figuring out a coding scheme where you can compute a code for each string and build a hash of similar codes to find similar items. But I couldn't think of a suitable scheme that would work with wildcarded 'N' characters--hence the brute force approach.

    ...roboticus
Re: simple string comparison for efficiency
by AnomalousMonk (Archbishop) on Jun 03, 2009 at 02:45 UTC
    I was interested by the problem presented and wanted to play around with Inline C code in Strawberry Perl, so I came up with the code I have posted on AnomalousMonk's scratchpad.

    As one would expect from a C-coded solution, it is faster than the Pure Perl bitwise string differencing approach of GrandFather in Re: simple string comparison for efficiency: about 1000% for strings of 24,000 bases, about 350% for 240,000 bases (on my laptop: YMMV).

    It uses an approach that, by happy accident, works for base characters A, C, T and G, and the don't-care character N. It can be extended a bit for other 'bases' (I know other characters are used to represent common base sequences), but sooner or later it will break down. There is a similar (and as yet untested) approach that I can think of that could reliably handle a much wider range of 'bases', but it would require a separate (but I think fairly fast), character-substitution encoding step to convert input files to a form amenable to differencing.

    Please take a look at the material on my scratchpad and, if you think it is of interest, I can post it in this thread or in a new thread as you think best.

      Using XOR and a lookup for this in C code it kinda pointless.

      The reason using the XOR the technique works relatively quickly in Perl, is because accessing strings character by character in Perl is slow as it involves a function call (substr) for every character. And subroutine calls (even built-ins; they're both opcodes), are pretty slow in Perl.

      My bitwise XOR + tr technique allows you to compare all the bytes in the string using a single opcode. And do so in a way that leaves behind a detectable signature at non-matching byte positions (which other comparision opcodes do not do), which allows you to find where the differences are. Or just count the differences with another single opcode.

      But in C, accessing the bytes of a string is cheap. So, directly comparing the bytes against each other, and each against 'N', in a compound test with short-circuiting boolean operators:

      if( p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N' ) {

      will be quicker than xoring the bytes and then checking the results against a lookup table:

      if (c_map[ str1[i] ^ str2[i] ]) {

      A simple C implementation:

      int seqDiff ( SV *s1, SV *s2 ) { STRLEN bytes1, bytes2; char *p1, *p2; int i; p1 = SvPV( s1, bytes1 ); p2 = SvPV( s2, bytes2 ); if( bytes1 != bytes2 ) return 0; for( i = 0; i < bytes1; ++i ) { if( p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N' ) { return 0; } } return 1; }

      runs substantially faster than your xor + lookup especially on short strings:

      C:\test>SeqDiffDC_mark_1 comparing strings of length 240003 Rate bitmask inline buk bitmask 50.8/s -- -92% -95% inline 634/s 1146% -- -34% buk 967/s 1802% 53% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=10 comparing strings of length 243 Rate bitmask inline buk bitmask 25730/s -- -65% -93% inline 72894/s 183% -- -81% buk 393465/s 1429% 440% -- benchmark results expected

      However, we can do even better, especially on the long strings. On 32 & 64-bit cpus, accessing and manipulating byte registers is relatively slow compared to manipulating natural register-sized entities. By comparing the strings in 4 or 8-byte chunks, and only comparing bytes when a mismatch is detected, you can achieve further substantial reductions in the number of comparisons made. Using 32-bit compares:

      int seqDiff32 ( SV *s1, SV *s2 ) { STRLEN bytes1, bytes2; char *p1, *p2; STRLEN dwords; STRLEN lastBytes; U32 *up1, *up2; int i; p1 = SvPV( s1, bytes1 ); p2 = SvPV( s2, bytes2 ); if( bytes1 != bytes2 ) return 0; up1 = (U32*)p1; up2 = (U32*)p2; dwords = bytes1 >> 2; lastBytes = bytes1 % 4; p1 += dwords << 2; p2 += dwords << 2; while( dwords-- ) { if( *up1 != *up2 ) { for( i = 0; i < 4; ++i ) { char c1 = ( (char*)up1 )[ i ]; char c2 = ( (char*)up2 )[ i ]; if( c1 != c2 && c1 != 'N' && c2 != 'N' ) { return 0; } } } up1++; up2++; } for( i = 0; i < lastBytes; ++i ) { char c1 = ( (char*)p1 )[ i ]; char c2 = ( (char*)p2 )[ i ]; if( c1 != c2 && c1 != 'N' && c2 != 'N' ) { return 0; } } return 1; }
      C:\test>SeqDiffDC_mark_1 -LEN=1 comparing strings of length 27 Rate bitmask inline buk64 buk buk32 bitmask 43724/s -- -42% -93% -93% -93% inline 75272/s 72% -- -88% -88% -88% buk64 613799/s 1304% 715% -- -1% -6% buk 620545/s 1319% 724% 1% -- -5% buk32 651674/s 1390% 766% 6% 5% -- C:\test>SeqDiffDC_mark_1 -LEN=10 comparing strings of length 243 Rate bitmask inline buk buk64 buk32 bitmask 25730/s -- -65% -93% -94% -95% inline 72894/s 183% -- -81% -84% -86% buk 393465/s 1429% 440% -- -12% -22% buk64 447289/s 1638% 514% 14% -- -11% buk32 504399/s 1860% 592% 28% 13% -- benchmark results expected comparing strings of length 240003 Rate bitmask inline buk buk64 buk32 bitmask 50.9/s -- -92% -95% -97% -98% inline 629/s 1136% -- -35% -59% -75% buk 962/s 1790% 53% -- -38% -62% buk64 1545/s 2935% 146% 61% -- -39% buk32 2539/s 4889% 304% 164% 64% -- benchmark results expected

      I also tried a 64-bit version which intuition says should be faster still but that proved not to be so. AT first that confused me, but looking more closely at you test data I realised that the reason is that no single 8-byte compare will succeed--as your test strings have differences in every 8-byte unit: 'ATGCATCGATGN' vs. 'ATGCATNGATGN--and the 64-bit version has to compare 8 bytes for each failure rather than 4 with the 32-bit version. If I modify the strings so that they are substantially similar:

      my $seq1 = 'ATGCATCGATGN' x $LEN; my $seq2 = 'ATGCATCGATGN' x $LEN;

      then the 64-bit version wins hands down as expected:

      C:\test>SeqDiffDC_mark_1 -LEN=1 comparing strings of length 27 Rate bitmask inline buk buk32 buk64 bitmask 43145/s -- -43% -93% -94% -94% inline 76220/s 77% -- -88% -89% -89% buk 651920/s 1411% 755% -- -2% -2% buk32 664615/s 1440% 772% 2% -- -1% buk64 668000/s 1448% 776% 2% 1% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=10 comparing strings of length 243 Rate bitmask inline buk buk32 buk64 bitmask 25665/s -- -65% -94% -96% -96% inline 73346/s 186% -- -84% -88% -88% buk 450664/s 1656% 514% -- -23% -29% buk32 587556/s 2189% 701% 30% -- -7% buk64 632327/s 2364% 762% 40% 8% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=100 comparing strings of length 2403 Rate bitmask inline buk buk32 buk64 bitmask 4723/s -- -87% -96% -99% -99% inline 37224/s 688% -- -66% -88% -91% buk 108924/s 2206% 193% -- -66% -75% buk32 321561/s 6709% 764% 195% -- -26% buk64 432770/s 9063% 1063% 297% 35% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=1000 comparing strings of length 24003 Rate bitmask inline buk buk32 buk64 bitmask 506/s -- -91% -96% -99% -99% inline 5834/s 1054% -- -54% -90% -94% buk 12793/s 2431% 119% -- -77% -86% buk32 55949/s 10966% 859% 337% -- -41% buk64 94726/s 18637% 1524% 640% 69% -- benchmark results expected C:\test>SeqDiffDC_mark_1 -LEN=10000 comparing strings of length 240003 Rate bitmask inline buk buk32 buk64 bitmask 51.3/s -- -92% -96% -99% -100% inline 635/s 1139% -- -51% -90% -94% buk 1294/s 2425% 104% -- -79% -88% buk32 6087/s 11774% 858% 370% -- -45% buk64 11162/s 21673% 1657% 762% 83% -- benchmark results expected

      /msg me if you want the modified version of your benchmark.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Most interesting...

        I realize I had naively (and unquestioningly) assumed that the look-up approach would allow one to effectively combine multiple operations into one, thus saving time. But now I see that
            p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N'
        will (assuming most characters are equal) boil down to something close to two indexed accesses on average, whereas
            c_map[ str1[i] ^ str2[i] ]
        will always involve three accesses. (I assume the costs of the  != and  ^ operations are about the same.)

        And that doesn't begin to address the fact that the look-up approach will not scale to register-wide operations!

        ... and enlightening.