Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Fast Identification Of String Difference

by AnomalousMonk (Abbot)
on Jan 17, 2011 at 03:31 UTC ( #882593=note: print w/ replies, xml ) Need Help??


in reply to Fast Identification Of String Difference

The 'classic' Perlish approach to this type of problem involves bitwise string boolean operations. The string  $diff generated by the bitwise-xor of characters in original sequence strings can be used to produce masks that can then be used to extract the differing sub-string sequences from the original strings.

use warnings; use strict; my $s1 = 'ACTGGACGTATGCA'; my $s2 = 'AGTG-ACGC-CGCA'; my $diff = $s1 ^ $s2; my @dpos; push @dpos, [ $-[1], $+[1] - $-[1] ] while $diff =~ m{ ([^\x00]+) }xmsg; print qq{diff at offset $_->[0], length $_->[1] \n} for @dpos; (my $mask = $diff) =~ tr{\x00}{\xff}c; $s1 &= $mask; $s2 &= $mask; my $differences = qr{ [^\x00]+ }xms; @dpos = (); while ($s1 =~ m{ ($differences) }xmsg) { # this code produces same result # my @diff_data = ($1); # $s2 =~ m{ ($differences) }xmsg; # push @diff_data, $1, $-[1]; # push @dpos, \@diff_data; push @dpos, [ $1, do { $s2 =~ m{ ($differences) }xmsg && $1, $-[1] } ] ; } print qq{@$_ \n} for @dpos;

Output:

diff at offset 1, length 1 diff at offset 4, length 1 diff at offset 8, length 3 C G 1 G - 4 TAT C-C 8

See @- and @+ in perlvar, also Bitwise Or and Exclusive Or and Bitwise And in perlop.

BrowserUk is very good on this general topic.

Update: Added better code example, doc links. And thanks to ELISHEVA.

Update: Fixed @- link above. What was I thinking?


Comment on Re: Fast Identification Of String Difference
Select or Download Code
Re^2: Fast Identification Of String Difference
by johngg (Abbot) on Jan 17, 2011 at 10:52 UTC

    Using a look-ahead with pos can save having to do the sums.

    knoppix@Microknoppix:~$ perl -E ' > $s1 = q{ACTGGACGTATGCA}; > $s2 = q{AGTG-ACGC-CGCA}; > $m = $s1 ^ $s2; > push @pos, pos( $m ) while $m =~ m{(?=([^\x00]))}g; > $m =~ tr{[\x01-\xfe]}{\xff}; > $c1 = $s1 & $m; > @p1 = $c1 =~ m{([^\x00])}g; > $c2 = $s2 & $m; > @p2 = $c2 =~ m{([^\x00])}g; > printf qq{%-2s%-2s%6d\n}, $p1[ $_ ], $p2[ $_ ], $pos[ $_ ] > for 0 .. $#pos;' C G 1 G - 4 T C 8 A - 9 T C 10 knoppix@Microknoppix:~$

    I hope this is of interest.

    Cheers,

    JohnGG

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://882593]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (13)
As of 2015-07-02 16:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (44 votes), past polls