Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

similar string matching

by Anonymous Monk
on Jul 05, 2004 at 07:25 UTC ( #371799=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I have at my work a little tricky task. It is a string matching problem. I want to compare two amino acid sequences (here in Letter code, single char represents one amino acid) to find at which positions one sequence lies in the other! example?

MAAGAAAAFAAAATTTTTTTTFTTTTTTTTTTTTAAAAEAAAARAAAAAA # 1. sequence TTTTTTTTFTTTTTTTTTTTT # 2. sequence result is: 2. lies at position 14 to 34 in 1.
simple? new examples
SUBSTITUTION AAAAEAAAARGAAATTTTFTTTTTTTTTTTTTTTTAAAAAAAAILVAAAAAAAA # 1. sequence TTTTFTTTATTTTTTDTTTTT # 2. sequence DELETION AAAAAAAAAAAAATTGTTTTTTTXXXXXTTTTTTTTTTMAAAAAAAAAAAAAAAA # 1. sequence TTGTTTTTTTTTTTTTTTTTM # 2. sequence REVERSE TTTTTTTTTTTTTTTTTTTT # 1. sequence AAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTAAAAAAAAAAAAAAAA # 2. sequence PERFECT MATCHING ONLY AT BEGIN AND END OF 2. SEQUENCE AAAAAAAAAAATTTTTTTTGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTAAAAAAA # 1.sequence TTTTTTTTGGGNNGGGEEGGGEGGGGGGTTTTTTTTT # 2. Sequence
I tried with the regexp and the module String::Approx and aslice with the option 'minimal_distance', but I don't like the return values for this module.

Any hints how to do "the best way"?

Murcia

edit (broquaint): dropped <pre> tags aand added formatting

Comment on similar string matching
Select or Download Code
Re: similar string matching
by chiburashka on Jul 05, 2004 at 07:55 UTC
    from what i understand :
    $x = "MAAGAAAAFAAAATTTTTTTTFTTTTTTTTTTTTAAAAEAAAARAAAAAA"; $y = "TTTTTTTTFTTTTTTTTTTTT"; if ($x =~ /$y/o) {print "the second amino acid is in the first";} if ($y =~ /$x/o) {print "the first amino acid is in the second";}
      That is the simplest case!
Re: similar string matching
by hv (Parson) on Jul 05, 2004 at 12:12 UTC

    I don't know about "best way" - I suspect String::Approx is about as good as we have right now. Here's an approach that may give some food for thought though:

    # assuming $left sequence is the longer one my $len = length($right); my @matches = map { (substr($left, $_, $len) ^ $right) =~ tr/\0/\0/ } 0 .. length($left) - $len;

    This takes advantage of the fact that the exclusive-or of two characters is zero iff the two characters are the same, to give a list of the number of matching characters when you line $right up against each position of $left.

    You may also want to take a look at the recent thread non-exact regexp matches, particularly the references to Text::Levenshtein and List::Compare - the author there appears to be trying to solve a harder problem, so you may find what you want falling out trivially from one of the proposed solutions there.

    Hugo

Re: similar string matching
by BrowserUk (Pope) on Jul 05, 2004 at 13:41 UTC

    Needs lots more testcases and no promises about speed, but this seems to work for the cases you outlined.

    Results: (underscores show the offset; lowercase left partial match; uppercase right partial match.

    P:\test>371799 Searching: MAAGAAAAFAAAATTTTTTTTFTTTTTTTTTTTTAAAAEAAAARAAAAAA for : TTTTTTTTFTTTTTTTTTTTT MAAGAAAAFAAAATTTTTTTTFTTTTTTTTTTTTAAAAEAAAARAAAAAA _____________TTTTTTTTFTTTTTTTTTTTT Searching: AAAAEAAAARGAAATTTTFTTTTTTTTTTTTTTTTAAAAAAAAILVAAAAAAAA for : TTTTFTTTATTTTTTDTTTTT AAAAEAAAARGAAATTTTFTTTTTTTTTTTTTTTTAAAAAAAAILVAAAAAAAA ______________ttttfttttttttttttttTT Searching: AAAAAAAAAAAAATTGTTTTTTTXXXXXTTTTTTTTTTMAAAAAAAAAAAAAAAA for : TTGTTTTTTTTTTTTTTTTTM AAAAAAAAAAAAATTGTTTTTTTXXXXXTTTTTTTTTTMAAAAAAAAAAAAAAAA _____________ttgttttttt-----TTTTTTTTTTM Searching: TTTTTTTTTTTTTTTTTTTT for : AAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTAAAAAAAAAAAAAAAA AAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTAAAAAAAAAAAAAAAA ____________TTTTTTTTTTTTTTTTTTTT Searching: AAAAAAAAAAATTTTTTTTGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTAAAAAAA for : TTTTTTTTGGGNNGGGEEGGGEGGGGGGTTTTTTTTT AAAAAAAAAAATTTTTTTTGGGGGGGGGGGGGGGGGGGGGTTTTTTTTTAAAAAAA ___________ttttttttgggggggggggggggGGGGGGTTTTTTTTT

    Code

Re: similar string matching
by educated_foo (Vicar) on Jul 05, 2004 at 14:05 UTC
    What you want is the unanchored version of String::Approx. In other words, what it does, except with zero cost for insertions at the start or end of the small sequence. The C code of S::A is a bit hairy, so modifying that might not be the best way. A google search for "local alignments" should turn up lots of good references -- bioinformatics is well-documented online if you know the right keywords.
Re: similar string matching
by dakkar (Hermit) on Jul 05, 2004 at 16:08 UTC

    The algorithms are quite well established, but they're not obvious. If you try to roll your own, you'll quite probably end up with very slow matching. Look at http://www.bioperl.org/ and you'll find all the information you need.

    The short version: use BLAST or derived methods.

    -- 
            dakkar - Mobilis in mobile
    

    Most of my code is tested...

    Perl is strongly typed, it just has very few types (Dan)

      Nice hint but what if I want to find single amino acid sequences? like cystine bridges? Murcia

        Ehm.. I'm not a biologist, I only know someone... so you almost lost me there.

        The BLAST and related algorithms do exactly what you asked for: they find all the possible matches between two sequences, ranking them by 'edit distance', i.e. the number of operations needed to obtain a perfect match.

        This is really all I know, for details ask an expert ;-)

        -- 
                dakkar - Mobilis in mobile
        

        Most of my code is tested...

        Perl is strongly typed, it just has very few types (Dan)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://371799]
Approved by Happy-the-monk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2014-09-21 17:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (173 votes), past polls