Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Fast Identification Of String Difference

by AnomalousMonk (Archbishop)
on Jan 17, 2011 at 03:31 UTC ( [id://882593]=note: print w/replies, xml ) Need Help??


in reply to Fast Identification Of String Difference

The 'classic' Perlish approach to this type of problem involves bitwise string boolean operations. The string  $diff generated by the bitwise-xor of characters in original sequence strings can be used to produce masks that can then be used to extract the differing sub-string sequences from the original strings.

use warnings; use strict; my $s1 = 'ACTGGACGTATGCA'; my $s2 = 'AGTG-ACGC-CGCA'; my $diff = $s1 ^ $s2; my @dpos; push @dpos, [ $-[1], $+[1] - $-[1] ] while $diff =~ m{ ([^\x00]+) }xmsg; print qq{diff at offset $_->[0], length $_->[1] \n} for @dpos; (my $mask = $diff) =~ tr{\x00}{\xff}c; $s1 &= $mask; $s2 &= $mask; my $differences = qr{ [^\x00]+ }xms; @dpos = (); while ($s1 =~ m{ ($differences) }xmsg) { # this code produces same result # my @diff_data = ($1); # $s2 =~ m{ ($differences) }xmsg; # push @diff_data, $1, $-[1]; # push @dpos, \@diff_data; push @dpos, [ $1, do { $s2 =~ m{ ($differences) }xmsg && $1, $-[1] } ] ; } print qq{@$_ \n} for @dpos;

Output:

diff at offset 1, length 1 diff at offset 4, length 1 diff at offset 8, length 3 C G 1 G - 4 TAT C-C 8

See @- and @+ in perlvar, also Bitwise Or and Exclusive Or and Bitwise And in perlop.

BrowserUk is very good on this general topic.

Update: Added better code example, doc links. And thanks to ELISHEVA.

Update: Fixed @- link above. What was I thinking?

Replies are listed 'Best First'.
Re^2: Fast Identification Of String Difference
by johngg (Canon) on Jan 17, 2011 at 10:52 UTC

    Using a look-ahead with pos can save having to do the sums.

    knoppix@Microknoppix:~$ perl -E ' > $s1 = q{ACTGGACGTATGCA}; > $s2 = q{AGTG-ACGC-CGCA}; > $m = $s1 ^ $s2; > push @pos, pos( $m ) while $m =~ m{(?=([^\x00]))}g; > $m =~ tr{[\x01-\xfe]}{\xff}; > $c1 = $s1 & $m; > @p1 = $c1 =~ m{([^\x00])}g; > $c2 = $s2 & $m; > @p2 = $c2 =~ m{([^\x00])}g; > printf qq{%-2s%-2s%6d\n}, $p1[ $_ ], $p2[ $_ ], $pos[ $_ ] > for 0 .. $#pos;' C G 1 G - 4 T C 8 A - 9 T C 10 knoppix@Microknoppix:~$

    I hope this is of interest.

    Cheers,

    JohnGG

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://882593]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-09-18 17:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (25 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.