Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Similarity of strings

by professa (Beadle)
on May 15, 2002 at 14:31 UTC ( #166724=perlquestion: print w/ replies, xml ) Need Help??
professa has asked for the wisdom of the Perl Monks concerning the following question:

Hi dear monks!

I'm in the need of checking 2 strings for degree of identity, and right now I'm doing this by checking char by char via 'substr()':

for (my $i = 0; $i < $len; $i++) { $score += ((substr($ref_seq, $i, 1)) eq (substr($test_seq, $i, 1))) +; } return ($score / $len);

That's pretty straightforward, but also pretty slow. The strings are two aligned aminoacid sequences like this one:
string1: "EKFCNSVVDLYSNLSKPFLDIVLYIFKLTSAIGAQGPA----SMMAYLVV" string2: "ERFCRQLSSMASKLI-------ISPFTLVYYTYQCFQSTGWPVSIFGYFI"

Is there another way to get the percent of identity of such strings like in my simple code above but more efficient and faster?
I found a hint to 'String::Approx' here on Perlmonks, but I don't really know if that's what I need, since it talks about distances and I don't know if that really is comparable to my percentage identity.

Cheerz, Micha

Comment on Similarity of strings
Select or Download Code
Re: Similarity of strings
by Beatnik (Parson) on May 15, 2002 at 14:48 UTC
    You can always try Text::Soundex but you should split your strings up in smaller bits for it to be effective...
    use Text::Soundex; if (soundex($string1) eq soundex($string2)) { print "Bingo!" } else { +print "bummer"; }
    String::Approx seems a good solution too... perhaps a better one than Text::Soundex *sniff*

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: Similarity of strings
by Sidhekin (Priest) on May 15, 2002 at 14:48 UTC

    substr() is pretty fast, but I guess this will be faster. I think :-)

    return map {tr/\0/\0/ / $len} $ref_seq ^ $test_seq;

    Update: If you are using the return value in scalar context, the above will yield 1, of course. Sorry about that.

    Try this:

    my($score) = map {tr/\0// / $len} $ref_seq ^ $test_seq; return $score;

    ... or even (might be faster with many short strings):

    return +(map{tr/\0// / $len} $ref_seq ^ $test_seq)[0];

    The Sidhekin
    print "Just another Perl ${\(trickster and hacker)},"

      This code does not work for me, I get 100% similarity for each strings compared. But I'm not bright enough at the moment to see where's the bug here...

      Micha

Re: Similarity of strings
by ariels (Curate) on May 15, 2002 at 14:51 UTC

    Use the XOR!

    my $x = "EKFCNSVVDLYSNLSKPFLDIVLYIFKLTSAIGAQGPA----SMMAYLVV"; my $y = "ERFCRQLSSMASKLI-------ISPFTLVYYTYQCFQSTGWPVSIFGYFI"; my $z = $x^$y; my $score = ($z =~ tr/\0//); return $score/length($x)

      XOR seems to be the fastest method! It needs ~3 seconds instead of ~7 seconds using SUBSTR. Well, more than twice the speed, cool! ;-)

      Thanx, Micha

Re: Similarity of strings
by VSarkiss (Monsignor) on May 15, 2002 at 14:52 UTC

    I'm not sure if this is exactly what you're looking for, but you can find longest common subsequences with Algorithm::Diff. With that, you could find the lengths of the differing sequences and divide that by the total length. (If you have trouble understanding the documentation for Algorithm::Diff, I wrote a module review which may help.

    If the code snippet you have above is an accurate description of what you're trying to calculate, it may be faster (though more memory-intensive) to split up the strings into arrays and compare an element at a time, rather than calling substr over and over. Something like this: (note, this is untested)

    my @ref_elems = split //, $ref_seq; my @test_elems = split //, $test_seq; my $score = 0; for (my $i = 0; $i < $len; $i++) { $score += $ref_elems[$i] eq $test_elems[$i]; }
    Once you have the sequences in arrays, you can use all kinds of nifty techniques like mapcar, which can traverse both arrays in one neat statement. The top of that node has a very clear explanation of how to use it.

    HTH

      I tested splitting the strings up into arrays and timed the two methods (simply via 'time script.pl').
      The split-method takes ~13 seconds to finish, the substr-method only ~7 seconds.
      The advantages of having the data ready in arrays doesn't count for me, I just need the percentage of similarity, and as fast as possible. ;-)
      I'll try out the rest of the suggested methods here and report which does best.

      Thanx, Micha

Re: Similarity of strings
by jmcnamara (Monsignor) on May 15, 2002 at 15:13 UTC

    I'd guess that the xor method is probably quicker but you could benchmark these as well:
    # Method 1 my @a1 = split //, $ref_seq; my @a2 = split //, $test_seq; $score += ($a1[$_] eq $a2[$_]) for 0 .. $#a1; # Method 2, Destructive $score += (chop $ref_seq eq chop $test_seq) while $ref_seq;

    --
    John.

      chop is a bit faster than substr, it needs ~5 seconds instead of 7 seconds.

      Thanx, Micha

Some of the above suggestions, benchmarked (Re: Similarity of strings)
by cmilfo (Hermit) on May 15, 2002 at 19:08 UTC
    I tried to gather some of the ideas above and merge them into one benchmarked thing. Also, I benchmarked replacing the split // and chop $var while $var with unpack. Anyway, hope you enjoy.

    #!/usr/bin/perl use Benchmark qw(timethese cmpthese); my $string1 = 'This is a test of the emergency system!'; my $string2 = 'This is a test of the emergency broadcast system!'; my $template1 = "a1" x (length $string1); my $template2 = "a1" x (length $string2); print &xor_o, "\n", &split_o, "\n", &unpack_o, "\n", &chop_o, "\n", &chop_o2, "\n"; my $results = timethese( 100000, { XOR => \&xor_o, SPLIT => \&split_o, UNPACK => \&unpack_o, CHOP => \&chop_o, CHOP2 => \&chop_o2 } ); cmpthese($results); sub xor_o { my $xor = $string1^$string2; return ($xor =~ tr/\0//)/length($string1); } sub split_o { my @string1 = split //, $string1; my @string2 = split //, $string2; my $score = 0; my $length = scalar @string1; for (my $i = 0; $i < $length; $i++) { $score += ($string1[$i] eq $string2[$i]); } return $score/$length; } sub unpack_o { my @string1 = (unpack $template1, $string1); my @string2 = (unpack $template2, $string2); my $score = 0; my $length = scalar @string1; for (my $i = 0; $i < $length; $i++) { $score += ($string1[$i] eq $string2[$i]); } return $score/$length; } sub chop_o { my @string1 = (); my @string2 = (); my $score = 0; my $rstring1 = scalar reverse $string1; my $rstring2 = scalar reverse $string2; push @string1, (chop $rstring1) while $rstring1; push @string2, (chop $rstring2) while $rstring2; my $length = scalar @string1; for (my $i = 0; $i < $length; $i++) { $score += ($string1[$i] eq $string2[$i]); } return $score/$length; } sub chop_o2 { my $str1 = $string1; my $str2 = $string2; my $length = length $string1; my $score; $score += (chop $str1 eq chop $str2) while $str1; return $score/$length; } Benchmark: timing 100000 iterations of CHOP, CHOP2, SPLIT, UNPACK, XOR +... CHOP: 43 wallclock secs (37.29 usr + 0.05 sys = 37.34 CPU) @ 26 +78.09/s (n=100000) CHOP2: 9 wallclock secs ( 7.34 usr + 0.03 sys = 7.37 CPU) @ 13 +568.52/s (n=100000) SPLIT: 47 wallclock secs (38.76 usr + 0.07 sys = 38.83 CPU) @ 25 +75.33/s (n=100000) UNPACK: 34 wallclock secs (29.51 usr + 0.03 sys = 29.54 CPU) @ 33 +85.24/s (n=100000) XOR: 2 wallclock secs ( 0.92 usr + 0.00 sys = 0.92 CPU) @ 10 +8695.65/s (n=100000) Rate SPLIT CHOP UNPACK CHOP2 XOR SPLIT 2575/s -- -4% -24% -81% -98% CHOP 2678/s 4% -- -21% -80% -98% UNPACK 3385/s 31% 26% -- -75% -97% CHOP2 13569/s 427% 407% 301% -- -88% XOR 108696/s 4121% 3959% 3111% 701% --
    Update: My apologies for the length @array lines above. That's what I get for yanking and putting. :) Thanks to those who caught it. New benchmarks are now shown.

    Update2: I've also added the chop2 implemented in the comment below.

      I'm glad that someone benchmarked this.

      However, you were a little bit unfair to the chop method. :-) The scalar reverse and array assignments aren't necessary. The following is 5 times faster (although still 5 times slower than the xor method):

      sub chop2 { my $str1 = $string1; my $str2 = $string2; my $length = length $string1; my $score; $score += (chop $str1 eq chop $str2) while $str1; return $score/$length; }

      Update: Albannach points out that because the strings in this test are not of equal length, the reverse is required. My code was based on the original sample data.

      Also, it is worth adding that the speed of the xor method is less dependent on the string length than the other methods.

      --
      John.

Re: Similarity of strings
by Anonymous Monk on May 15, 2002 at 22:49 UTC
      Yep, I used BLAST, followed my mview (http://mathbio.nimr.mrc.ac.uk/~nbrown/mview/) to prepare an alignment from the BLAST-report.
      But MVIEW doesn't provide the similarity stuff (but does provide e-value, score, and some other things) and I don't have the guts to parse the complete BLAST-report (which can be very big sometimes) just to get the similarity.
      I prefer to check only those sequences which are of interest to me directly.

      Cheerz, Micha

        Perhaps bioperl has a BLAST parser you could use? (I've never used their parsers, so I cannot comment further.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://166724]
Approved by broquaint
Front-paged by rob_au
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2014-10-26 01:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (149 votes), past polls