http://www.perlmonks.org?node_id=371877


in reply to similar string matching

Needs lots more testcases and no promises about speed, but this seems to work for the cases you outlined.

Results: (underscores show the offset; lowercase left partial match; uppercase right partial match.

P:\test>371799 Searching: MAAGAAAAFAAAATTTTTTTTFTTTTTTTTTTTTAAAAEAAAARAAAAAA for : TTTTTTTTFTTTTTTTTTTTT MAAGAAAAFAAAATTTTTTTTFTTTTTTTTTTTTAAAAEAAAARAAAAAA _____________TTTTTTTTFTTTTTTTTTTTT Searching: AAAAEAAAARGAAATTTTFTTTTTTTTTTTTTTTTAAAAAAAAILVAAAAAAAA for : TTTTFTTTATTTTTTDTTTTT AAAAEAAAARGAAATTTTFTTTTTTTTTTTTTTTTAAAAAAAAILVAAAAAAAA ______________ttttfttttttttttttttTT Searching: AAAAAAAAAAAAATTGTTTTTTTXXXXXTTTTTTTTTTMAAAAAAAAAAAAAAAA for : TTGTTTTTTTTTTTTTTTTTM AAAAAAAAAAAAATTGTTTTTTTXXXXXTTTTTTTTTTMAAAAAAAAAAAAAAAA _____________ttgttttttt-----TTTTTTTTTTM Searching: TTTTTTTTTTTTTTTTTTTT for : AAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTAAAAAAAAAAAAAAAA AAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTAAAAAAAAAAAAAAAA ____________TTTTTTTTTTTTTTTTTTTT Searching: AAAAAAAAAAATTTTTTTTGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTAAAAAAA for : TTTTTTTTGGGNNGGGEEGGGEGGGGGGTTTTTTTTT AAAAAAAAAAATTTTTTTTGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTAAAAAAA ___________ttttttttgggggggggggggggGGGGGGTTTTTTTTT

Code

#! perl -slw use strict; use List::Util qw[ reduce ]; # PAIR: ##Removed after testing while( my $haystack = <DATA> ) { $haystack =~ s[^\s*(\S+?)\b.*$][$1]s; ## Strips comments ( my $needle = <DATA> ) =~ s[^\s*(\S+?)\b.*$][$1]s; print "\nSearching: $haystack\nfor\t: $needle\n"; ( $haystack, $needle ) = ( $needle, $haystack ) if length( $haystack ) < length( $needle ); ## Reversed ## Try for a straight match while( $haystack =~ m[$needle]g ) { printf "$haystack\n%s%-*s\n\n" , '_' x (pos( $haystack ) - length( $needle )) , length( $haystack ) , $needle; ## next PAIR; ## Removed after testing } ## No joy, so try for a left-right partial. my $re_left = reduce{ "$a$b?" } split '', $needle; my $re_right = reduce{ "$a?$b" } split '', $needle; my $re = qr[(($re_left)(.*?)($re_right))]; while( $haystack =~ m[$re]g ) { printf "$haystack\n%s%-*s%s%*s\n\n" , '_' x ( pos( $haystack ) - length( $1 ) ) , length( $2 ) , lc($2) , '-' x length( $3 ) , length( $4 ) , $4; # next PAIR; ## Removed after testing } } __DATA__ MAAGAAAAFAAAATTTTTTTTFTTTTTTTTTTTTAAAAEAAAARAAAAAA # 1. sequen +ce TTTTTTTTFTTTTTTTTTTTT # 2. sequen +ce AAAAEAAAARGAAATTTTFTTTTTTTTTTTTTTTTAAAAAAAAILVAAAAAAAA # 1. sequen +ce TTTTFTTTATTTTTTDTTTTT # 2. sequen +ce AAAAAAAAAAAAATTGTTTTTTTXXXXXTTTTTTTTTTMAAAAAAAAAAAAAAAA # 1. sequen +ce TTGTTTTTTTTTTTTTTTTTM # 2. sequen +ce TTTTTTTTTTTTTTTTTTTT # 1. sequen +ce AAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTAAAAAAAAAAAAAAAA # 2. sequen +ce AAAAAAAAAAATTTTTTTTGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTAAAAAAA # 1.sequenc +e TTTTTTTTGGGNNGGGEEGGGEGGGGGGTTTTTTTTT # 2. Sequen +ce

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon