CaptainF has asked for the wisdom of the Perl Monks concerning the following question:

Here is an easy one for all of you gurus. There is a premium on efficiency as I need to conduct the following operation on approximately 1 X 10^8 comparisons. I need to evaluate if a pair of strings of the same length are identical after excluding all positions that have an 'N' in one or both strings. Both strings will always consist of the character set ATGCN. For example, the comparison of \$a and \$b would meet my criterion: \$a = 'ATGNCNC'; \$b = 'ATGACNN'; But \$c and \$d would not: \$c = 'ATGNCNC'; \$d = 'TTGNNNC'; (because the first character differs at a position that does not contain an 'N' in either string) Again, there is an extreme premium on efficiency here. Any thoughts?
• Comment on simple string comparison for efficiency

Replies are listed 'Best First'.
Re: simple string comparison for efficiency
by GrandFather (Saint) on May 28, 2009 at 21:05 UTC

Tell us about the larger picture. There is likely to be room for improving whatever you are doing at present, or at least for us to target solutions to whatever the bigger problem is. Most gains in performance are not obtained by micro-optimizations such as you are asking for. Some tricks that will give huge performance gains in some situations will cripple the code in other situations.

That said, you can build masks then xor the strings to give something like the fast match you are looking for where the strings to be matched are fairly long. Consider:

```use strict;
use warnings;

my \$strA = 'ATGNCNC';
my \$strB = 'ATGACNN';
my \$strC = 'TTGACNN';

print \$strA, (match (\$strA, \$strB) ? ' eq' : ' ne'), " \$strB\n";
print \$strA, (match (\$strA, \$strC) ? ' eq' : ' ne'), " \$strC\n";

sub match {
my (\$str1, \$str2) = @_;

\$str1 ^= \$str2;

return \$str1 !~ /[^\x00]/;
}

Prints:

```ATGNCNC eq ATGACNN
ATGNCNC ne TTGACNN

If you can cache the masks (say you were matching all strings against all others for example) then you get a greater gain.

True laziness is hard work
Grandfather, your solution using bitwise operators would not have occurred to me, but was exactly what I needed. It solved the problem several orders of magnitude faster than my solution. Is there a simple way to extract the number of string positions where one or both strings had an 'N' from your code?

```    my \$nCount = (\$mask1 =~ tr/N//) + (\$mask2 =~ tr/N//);

as the third line of sub match.

True laziness is hard work
Re: simple string comparison for efficiency
by Corion (Pope) on May 28, 2009 at 20:47 UTC

Searching this site for recommendations of BioPerl and/or Fasta will likely give you lots of results. For example, Lower-casing Substrings and Iterating Two Files together has a problem very similar to yours and also a solution. Of course, if the "extreme premium on efficiency" you cite will be backed up by an extreme premium of money, you can achieve better results.

Searching this site for recommendations of BioPerl...

If you are looking for an efficient solution, stay away from BioPerl

citromatik

Re: simple string comparison for efficiency
by roboticus (Chancellor) on May 29, 2009 at 14:59 UTC
CaptainF:

I think the first thing I'd do is to break the problem into a smaller set of problems to reduce the number of comparisons to make. If you have 10,000 strings to compare, a brute force approach has you making about 1x10^8 comparisons (roughly N^2). But if you can break your files into M roughly equal groups, you've already reduced the comparisons to roughly M * (N/M)^2. For M=32, it would be about 3.13x10^6 comparisons, nearly two orders of magnitude. Breaking your data set into multiple small groups will give you a big time savings.

(Yes, it's hand-wavy and I made no attempt to be rigorous. Obviously a degenerate case wouldn't do you any good, and some files will appear in multiple groups, of course. But I'm guessing that your data won't fall into the worst case. If you want rigor, please send a check for the appropriate amount, and I'll take care of it--Heh, heh!)

An approach something like the following might be able to break your list of files into plenty of groups:

```Input:
@FileNames : A list of candidate files

Step 1: Build a hash to group potentially similar strings
{
local \$/ = \32;
for my \$curFile (@FileNames) {
open INF, '<', \$curFile or die...
my \$key = <INF>;
close INF or die...

# list of files matching the first 32 chars
push @{\$Candidates{\$key}{LIST}}, \$curFile;

# regex for matching similar groups
my \$tmp = \$key;
\$tmp =~ s/N/./;
\$Candidates{\$key}{REGEX} = qr/\$r/;
}
}

Step 2: Merge compatible groups
for my \$key (keys %Candidates) {
\$Candidates{\$key}{SIMILAR} =
grep { /\$Candidates{\$key}{REGEX}/ && (\$_ ne \$key) }
keys %Candidates;
}

Step 3: Remove unmatchable items
my %UnmatchableKeys;
for my \$key (keys %Candidates) {
next if exists \$Candidates{\$key}{SIMILAR};
next if scalar @{\$Candidates}{\$key}{LIST}} > 1;
\$UnmatchableKeys{\$key} = 1;
}

Step 4: Generate lists of items needing to be compared
for my \$key (keys %Candidates} {
next if exists \$UnmatchableKeys{\$key};
my @similarFiles = (@{\$Candidates}{\$key}{LIST}});
for my \$others (keys %{\$Candidates{\$key}{SIMILAR}}) {
push @similarFiles, @{\$Candidates{\$key}{LIST}};
}
print "GROUP: ", join(", ", sort @similarFiles), "\n";
}

(Note: Horribly untested, and I probably botched a good bit of the syntax.)

There some improvements you can make, such as removing identical groups, etc. You can also apply this recursively, taking the next N bytes and further partitioning your sets until you run out of text.

Another approach you might take is figuring out a coding scheme where you can compute a code for each string and build a hash of similar codes to find similar items. But I couldn't think of a suitable scheme that would work with wildcarded 'N' characters--hence the brute force approach.

...roboticus
Re: simple string comparison for efficiency
by AnomalousMonk (Bishop) on Jun 03, 2009 at 02:45 UTC
I was interested by the problem presented and wanted to play around with Inline C code in Strawberry Perl, so I came up with the code I have posted on AnomalousMonk's scratchpad.

As one would expect from a C-coded solution, it is faster than the Pure Perl bitwise string differencing approach of GrandFather in Re: simple string comparison for efficiency: about 1000% for strings of 24,000 bases, about 350% for 240,000 bases (on my laptop: YMMV).

It uses an approach that, by happy accident, works for base characters A, C, T and G, and the don't-care character N. It can be extended a bit for other 'bases' (I know other characters are used to represent common base sequences), but sooner or later it will break down. There is a similar (and as yet untested) approach that I can think of that could reliably handle a much wider range of 'bases', but it would require a separate (but I think fairly fast), character-substitution encoding step to convert input files to a form amenable to differencing.

Please take a look at the material on my scratchpad and, if you think it is of interest, I can post it in this thread or in a new thread as you think best.

Using XOR and a lookup for this in C code it kinda pointless.

The reason using the XOR the technique works relatively quickly in Perl, is because accessing strings character by character in Perl is slow as it involves a function call (substr) for every character. And subroutine calls (even built-ins; they're both opcodes), are pretty slow in Perl.

My bitwise XOR + tr technique allows you to compare all the bytes in the string using a single opcode. And do so in a way that leaves behind a detectable signature at non-matching byte positions (which other comparision opcodes do not do), which allows you to find where the differences are. Or just count the differences with another single opcode.

But in C, accessing the bytes of a string is cheap. So, directly comparing the bytes against each other, and each against 'N', in a compound test with short-circuiting boolean operators:

```        if( p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N' ) {

will be quicker than xoring the bytes and then checking the results against a lookup table:

```        if (c_map[ str1[i] ^ str2[i] ]) {

A simple C implementation:

```int seqDiff ( SV *s1, SV *s2 ) {
STRLEN bytes1, bytes2;
char *p1, *p2;
int i;

p1 = SvPV( s1, bytes1 );
p2 = SvPV( s2, bytes2 );
if( bytes1 != bytes2 ) return 0;

for( i = 0; i < bytes1; ++i ) {
if( p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N' ) {
return 0;
}
}
return 1;
}

runs substantially faster than your xor + lookup especially on short strings:

```C:\test>SeqDiffDC_mark_1
comparing strings of length 240003
inline   634/s   1146%      --    -34%
buk      967/s   1802%     53%      --
benchmark results expected

C:\test>SeqDiffDC_mark_1 -LEN=10
comparing strings of length 243
inline   72894/s    183%      --    -81%
buk     393465/s   1429%    440%      --
benchmark results expected

However, we can do even better, especially on the long strings. On 32 & 64-bit cpus, accessing and manipulating byte registers is relatively slow compared to manipulating natural register-sized entities. By comparing the strings in 4 or 8-byte chunks, and only comparing bytes when a mismatch is detected, you can achieve further substantial reductions in the number of comparisons made. Using 32-bit compares:

```int seqDiff32 ( SV *s1, SV *s2 ) {
STRLEN bytes1, bytes2;
char *p1, *p2;
STRLEN dwords;
STRLEN lastBytes;
U32 *up1, *up2;
int i;

p1 = SvPV( s1, bytes1 );
p2 = SvPV( s2, bytes2 );
if( bytes1 != bytes2 ) return 0;

up1 = (U32*)p1;
up2 = (U32*)p2;
dwords = bytes1 >> 2;
lastBytes = bytes1 % 4;
p1 += dwords << 2;
p2 += dwords << 2;

while( dwords-- ) {
if( *up1 != *up2 ) {
for( i = 0; i < 4; ++i ) {
char c1 = ( (char*)up1 )[ i ];
char c2 = ( (char*)up2 )[ i ];
if( c1 != c2 && c1 != 'N' && c2 != 'N' ) {
return 0;
}
}
}
up1++; up2++;
}

for( i = 0; i < lastBytes; ++i ) {
char c1 = ( (char*)p1 )[ i ];
char c2 = ( (char*)p2 )[ i ];
if( c1 != c2 && c1 != 'N' && c2 != 'N' ) {
return 0;
}
}
return 1;
}
```C:\test>SeqDiffDC_mark_1 -LEN=1
comparing strings of length 27
Rate bitmask  inline   buk64     buk   buk32
bitmask  43724/s      --    -42%    -93%    -93%    -93%
inline   75272/s     72%      --    -88%    -88%    -88%
buk64   613799/s   1304%    715%      --     -1%     -6%
buk     620545/s   1319%    724%      1%      --     -5%
buk32   651674/s   1390%    766%      6%      5%      --

C:\test>SeqDiffDC_mark_1 -LEN=10
comparing strings of length 243
Rate bitmask  inline     buk   buk64   buk32
bitmask  25730/s      --    -65%    -93%    -94%    -95%
inline   72894/s    183%      --    -81%    -84%    -86%
buk     393465/s   1429%    440%      --    -12%    -22%
buk64   447289/s   1638%    514%     14%      --    -11%
buk32   504399/s   1860%    592%     28%     13%      --
benchmark results expected

comparing strings of length 240003
Rate bitmask  inline     buk   buk64   buk32
bitmask 50.9/s      --    -92%    -95%    -97%    -98%
inline   629/s   1136%      --    -35%    -59%    -75%
buk      962/s   1790%     53%      --    -38%    -62%
buk64   1545/s   2935%    146%     61%      --    -39%
buk32   2539/s   4889%    304%    164%     64%      --
benchmark results expected

I also tried a 64-bit version which intuition says should be faster still but that proved not to be so. AT first that confused me, but looking more closely at you test data I realised that the reason is that no single 8-byte compare will succeed--as your test strings have differences in every 8-byte unit: 'ATGCATCGATGN' vs. 'ATGCATNGATGN--and the 64-bit version has to compare 8 bytes for each failure rather than 4 with the 32-bit version. If I modify the strings so that they are substantially similar:

```my \$seq1 = 'ATGCATCGATGN' x \$LEN;
my \$seq2 = 'ATGCATCGATGN' x \$LEN;

then the 64-bit version wins hands down as expected:

```C:\test>SeqDiffDC_mark_1 -LEN=1
comparing strings of length 27
Rate bitmask  inline     buk   buk32   buk64
bitmask  43145/s      --    -43%    -93%    -94%    -94%
inline   76220/s     77%      --    -88%    -89%    -89%
buk     651920/s   1411%    755%      --     -2%     -2%
buk32   664615/s   1440%    772%      2%      --     -1%
buk64   668000/s   1448%    776%      2%      1%      --
benchmark results expected

C:\test>SeqDiffDC_mark_1 -LEN=10
comparing strings of length 243
Rate bitmask  inline     buk   buk32   buk64
bitmask  25665/s      --    -65%    -94%    -96%    -96%
inline   73346/s    186%      --    -84%    -88%    -88%
buk     450664/s   1656%    514%      --    -23%    -29%
buk32   587556/s   2189%    701%     30%      --     -7%
buk64   632327/s   2364%    762%     40%      8%      --
benchmark results expected

C:\test>SeqDiffDC_mark_1 -LEN=100
comparing strings of length 2403
Rate bitmask  inline     buk   buk32   buk64
bitmask   4723/s      --    -87%    -96%    -99%    -99%
inline   37224/s    688%      --    -66%    -88%    -91%
buk     108924/s   2206%    193%      --    -66%    -75%
buk32   321561/s   6709%    764%    195%      --    -26%
buk64   432770/s   9063%   1063%    297%     35%      --
benchmark results expected

C:\test>SeqDiffDC_mark_1 -LEN=1000
comparing strings of length 24003
Rate bitmask  inline     buk   buk32   buk64
bitmask   506/s      --    -91%    -96%    -99%    -99%
inline   5834/s   1054%      --    -54%    -90%    -94%
buk     12793/s   2431%    119%      --    -77%    -86%
buk32   55949/s  10966%    859%    337%      --    -41%
buk64   94726/s  18637%   1524%    640%     69%      --
benchmark results expected

C:\test>SeqDiffDC_mark_1 -LEN=10000
comparing strings of length 240003
Rate bitmask  inline     buk   buk32   buk64
bitmask  51.3/s      --    -92%    -96%    -99%   -100%
inline    635/s   1139%      --    -51%    -90%    -94%
buk      1294/s   2425%    104%      --    -79%    -88%
buk32    6087/s  11774%    858%    370%      --    -45%
buk64   11162/s  21673%   1657%    762%     83%      --
benchmark results expected

/msg me if you want the modified version of your benchmark.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Most interesting...

I realize I had naively (and unquestioningly) assumed that the look-up approach would allow one to effectively combine multiple operations into one, thus saving time. But now I see that
p1[ i ] != p2[ i ] && p1[ i ] != 'N' && p2[ i ] != 'N'
will (assuming most characters are equal) boil down to something close to two indexed accesses on average, whereas
c_map[ str1[i] ^ str2[i] ]
will always involve three accesses. (I assume the costs of the  != and  ^ operations are about the same.)

And that doesn't begin to address the fact that the look-up approach will not scale to register-wide operations!

... and enlightening.