Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

similar string matching

by Anonymous Monk
on Jul 05, 2004 at 07:25 UTC ( #371799=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I have at my work a little tricky task. It is a string matching problem. I want to compare two amino acid sequences (here in Letter code, single char represents one amino acid) to find at which positions one sequence lies in the other! example?

MAAGAAAAFAAAATTTTTTTTFTTTTTTTTTTTTAAAAEAAAARAAAAAA # 1. sequence TTTTTTTTFTTTTTTTTTTTT # 2. sequence result is: 2. lies at position 14 to 34 in 1.
simple? new examples
SUBSTITUTION AAAAEAAAARGAAATTTTFTTTTTTTTTTTTTTTTAAAAAAAAILVAAAAAAAA # 1. sequence TTTTFTTTATTTTTTDTTTTT # 2. sequence DELETION AAAAAAAAAAAAATTGTTTTTTTXXXXXTTTTTTTTTTMAAAAAAAAAAAAAAAA # 1. sequence TTGTTTTTTTTTTTTTTTTTM # 2. sequence REVERSE TTTTTTTTTTTTTTTTTTTT # 1. sequence AAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTAAAAAAAAAAAAAAAA # 2. sequence PERFECT MATCHING ONLY AT BEGIN AND END OF 2. SEQUENCE AAAAAAAAAAATTTTTTTTGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTAAAAAAA # 1.sequence TTTTTTTTGGGNNGGGEEGGGEGGGGGGTTTTTTTTT # 2. Sequence
I tried with the regexp and the module String::Approx and aslice with the option 'minimal_distance', but I don't like the return values for this module.

Any hints how to do "the best way"?

Murcia

edit (broquaint): dropped <pre> tags aand added formatting

Comment on similar string matching
Select or Download Code
Re: similar string matching
by chiburashka on Jul 05, 2004 at 07:55 UTC
    from what i understand :
    $x = "MAAGAAAAFAAAATTTTTTTTFTTTTTTTTTTTTAAAAEAAAARAAAAAA"; $y = "TTTTTTTTFTTTTTTTTTTTT"; if ($x =~ /$y/o) {print "the second amino acid is in the first";} if ($y =~ /$x/o) {print "the first amino acid is in the second";}
      That is the simplest case!
Re: similar string matching
by hv (Parson) on Jul 05, 2004 at 12:12 UTC

    I don't know about "best way" - I suspect String::Approx is about as good as we have right now. Here's an approach that may give some food for thought though:

    # assuming $left sequence is the longer one my $len = length($right); my @matches = map { (substr($left, $_, $len) ^ $right) =~ tr/\0/\0/ } 0 .. length($left) - $len;

    This takes advantage of the fact that the exclusive-or of two characters is zero iff the two characters are the same, to give a list of the number of matching characters when you line $right up against each position of $left.

    You may also want to take a look at the recent thread non-exact regexp matches, particularly the references to Text::Levenshtein and List::Compare - the author there appears to be trying to solve a harder problem, so you may find what you want falling out trivially from one of the proposed solutions there.

    Hugo

Re: similar string matching
by BrowserUk (Pope) on Jul 05, 2004 at 13:41 UTC

    Needs lots more testcases and no promises about speed, but this seems to work for the cases you outlined.

    Results: (underscores show the offset; lowercase left partial match; uppercase right partial match.

    P:\test>371799 Searching: MAAGAAAAFAAAATTTTTTTTFTTTTTTTTTTTTAAAAEAAAARAAAAAA for : TTTTTTTTFTTTTTTTTTTTT MAAGAAAAFAAAATTTTTTTTFTTTTTTTTTTTTAAAAEAAAARAAAAAA _____________TTTTTTTTFTTTTTTTTTTTT Searching: AAAAEAAAARGAAATTTTFTTTTTTTTTTTTTTTTAAAAAAAAILVAAAAAAAA for : TTTTFTTTATTTTTTDTTTTT AAAAEAAAARGAAATTTTFTTTTTTTTTTTTTTTTAAAAAAAAILVAAAAAAAA ______________ttttfttttttttttttttTT Searching: AAAAAAAAAAAAATTGTTTTTTTXXXXXTTTTTTTTTTMAAAAAAAAAAAAAAAA for : TTGTTTTTTTTTTTTTTTTTM AAAAAAAAAAAAATTGTTTTTTTXXXXXTTTTTTTTTTMAAAAAAAAAAAAAAAA _____________ttgttttttt-----TTTTTTTTTTM Searching: TTTTTTTTTTTTTTTTTTTT for : AAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTAAAAAAAAAAAAAAAA AAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTAAAAAAAAAAAAAAAA ____________TTTTTTTTTTTTTTTTTTTT Searching: AAAAAAAAAAATTTTTTTTGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTAAAAAAA for : TTTTTTTTGGGNNGGGEEGGGEGGGGGGTTTTTTTTT AAAAAAAAAAATTTTTTTTGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTAAAAAAA ___________ttttttttgggggggggggggggGGGGGGTTTTTTTTT

    Code

Re: similar string matching
by educated_foo (Vicar) on Jul 05, 2004 at 14:05 UTC
    What you want is the unanchored version of String::Approx. In other words, what it does, except with zero cost for insertions at the start or end of the small sequence. The C code of S::A is a bit hairy, so modifying that might not be the best way. A google search for "local alignments" should turn up lots of good references -- bioinformatics is well-documented online if you know the right keywords.
Re: similar string matching
by dakkar (Hermit) on Jul 05, 2004 at 16:08 UTC

    The algorithms are quite well established, but they're not obvious. If you try to roll your own, you'll quite probably end up with very slow matching. Look at http://www.bioperl.org/ and you'll find all the information you need.

    The short version: use BLAST or derived methods.

    -- 
            dakkar - Mobilis in mobile
    

    Most of my code is tested...

    Perl is strongly typed, it just has very few types (Dan)

      Nice hint but what if I want to find single amino acid sequences? like cystine bridges? Murcia

        Ehm.. I'm not a biologist, I only know someone... so you almost lost me there.

        The BLAST and related algorithms do exactly what you asked for: they find all the possible matches between two sequences, ranking them by 'edit distance', i.e. the number of operations needed to obtain a perfect match.

        This is really all I know, for details ask an expert ;-)

        -- 
                dakkar - Mobilis in mobile
        

        Most of my code is tested...

        Perl is strongly typed, it just has very few types (Dan)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://371799]
Approved by Happy-the-monk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2014-09-01 20:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (17 votes), past polls