Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Maximal Parsimony Problem

by neversaint (Deacon)
on Sep 03, 2008 at 13:54 UTC ( #708758=perlquestion: print w/replies, xml ) Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,

Given a set of align sequences I want to compute the minimum number of substitution for each column. The example is below:
CGGCGGAAAACTGTCCTCCGTGC mouse CGACGGAACATTCTCCTCCGCGC rat CGACGGAATATTCCCCTCCGTGC human CGACGGAAGACTCTCCTCCGTGC chimp 00100000302011000000100 -> number of subst per site (max parsimony)
Is there a way to compute the value above? My current code does not do the job:
use Data::Dumper; use List::MoreUtils qw(uniq); # The related phylogenetic in Newick format tree is: my $tree = '(mouse,rat,(human,chimp))'; my $sites = [ 'CGGCGGAAAACTGTCCTCCGTGC', # mouse 'CGACGGAACATTCTCCTCCGCGC', # rat 'CGACGGAATATTCCCCTCCGTGC', # human 'CGACGGAAGACTCTCCTCCGTGC', # chimp ]; my @val = my_parsimony($sites); print Dumper \@val; sub my_parsimony { my $tfbs = shift; my $mlen = length($tfbs->[0]); my $sum_min = 0; my @mincol; foreach my $pos ( 0 .. $mlen-1 ) { my @colbp = (); foreach my $site ( @{$tfbs} ) { my $bp = substr($site,$pos,1); push @colbp, $bp; } # this heuristic seems to be faulty # Column 11 it predicts 1 instead of 2 my $min_mm = scalar( uniq(@colbp) ) - 1; push @mincol, $min_mm; } return @mincol; }
Actually I want to mimic Fitch's classical parsimony algorithm. But cannot fathom how to implement it.


Update:Problem resolved. Thanks everybody. psini+++.

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Maximal Parsimony Problem
by moritz (Cardinal) on Sep 03, 2008 at 14:05 UTC
    # this heuristic seems to be faulty my $min_mm = scalar( uniq(@colbp) ) - 1;

    To me this seems perfectly right. If you want to speed up things a bit more (this is only a micro optimization) you can use a hash in the first place:

    my %chars; foreach my $site ( @{$tfbs} ) { my $bp = substr($site,$pos,1); $chars{$bp} = 1; } # this heuristic seems to be faulty my $min_mm = keys %chars - 1; push @mincol, $min_mm;

    Instead I think that your sample output is wrong:

    # is: 00100000302011000000100 # should be: 00100000301011000000100

    If not, could you please explain how to get the 2 there?

      Hi Moritz,
      It is based on phylogenetic tree, the substitution comparison from lower nodes must be compared to higher level.

      E.g. For column 11, the bases for row 3 is compared to row 1. Same result will also be given if we compare row 3 to row 2. Both row 1 and row 2 has higher level than row 3 row 4 in phylogenetic tree.

      ---
      neversaint and everlastingly indebted.......
        I still don't quite understand. We have
        human: C chimp: T mouse: T rat: C

        So which comparisons exactly lead to to a number of 2? With which rules?

        As a poor perl programmer I'm not familiar with the phylogenetic tree, so if that's relevant here please explain in which relation these four examples appear in that tree.

        Have you changed your example?
        A C T G
        How does that become 3? And the column moritz quoted above used to bee:
        C T T T
Re: Maximal Parsimony Problem
by BioNrd (Monk) on Sep 03, 2008 at 16:40 UTC
    While I can not give you a direct answer to your problem. I can tell you that there are other options out there that you should take a look at.

    If you do want to work on the code yourself bioperl may be a good place to look. I myself have not worked with the tools there, but I get the sense others have done similar things to what you are trying to do. So there maybe something there that is helpful for you.

    There are also numerous programs that grappled with Fitch's algorithm (and other algorithms for that matter), that might avoid the trouble of reinventing the wheel from scratch. For some helpful links to programs you might want to check out the Workshop on Molecular Evolution.

    Sorry I can't help more than that.
    Bio.

    ---- Even a blind squirrel finds a nut sometimes.
Re: Maximal Parsimony Problem
by psini (Deacon) on Sep 03, 2008 at 15:26 UTC

    I think that the idea is that given the tree:

    $tree = '(mouse,rat,(human,chimp))'

    the values in column 11 should be treated like this: (T,C,(C,T)) and so the value of 2 results from T ne C ne (C,T); whereas for instance column 3 leads to (G,A,(A,A)) and G ne A eq (A,A).

    This is just a wild guess but, if I understand the logic right, (T,C,(A,G))=>3 and (T,C,(C,A))=>2. If it is so, the algorithm should not be so simple when more complex trees are involved.

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Re: Maximal Parsimony Problem
by SFLEX (Chaplain) on Sep 03, 2008 at 15:58 UTC
Re: Maximal Parsimony Problem
by psini (Deacon) on Sep 03, 2008 at 16:26 UTC

    Just a simple question: am I wrong or, from a cladistic point of view, any node of the tree should have exactly zero or two descendant, one of which is a terminal node?

    For if this is the case the problem becomes quite a trivial one, but your example doesn't match my hypothesis.

    Update: forget the second condition, it's bullshit.

    But I think that the first one holds, for you can have ((rat,mouse),(human,chimp)) or ((rat,(mouse,(human,chimp)) or (mouse,(rat,(human,chimp)) but it make no sense (rat,mouse,(human,chimp)), particularly if you are checking the variation score independently on each base. For the total score to make sense you should assume one more common ancestor or the score for base X will not be consistent with that of base Y.

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

      Dear psini,

      Thank you so much for your thoughtful replies.

      You are right, that the phylogenetic tree is always a binary tree.
      I guess the problem I gave you is the tree "representation". Newick's format gives a general-ordered tree.
      And it seems that we always can convert this general-ordered tree into a binary tree.


      ---
      neversaint and everlastingly indebted.......
Re: Maximal Parsimony Problem
by dwm042 (Priest) on Sep 03, 2008 at 19:09 UTC
    neversaint, the problem here is that you're giving us genetic data and you're not giving us the genetic tree. Fitch's algorithm solves the small parsimony problem. The small parsimony problem requires a genetic tree as input. With the data we have been given in this node, you're asking us to solve the large parsimony problem, which is much much harder, and is also NP complete.

      Dear dwm042,

      You are right. It is a small parsimony problem.
      BTW, from the start I have already included the "genetic tree input" in my sample code.

      ---
      neversaint and everlastingly indebted.......
Re: Maximal Parsimony Problem
by psini (Deacon) on Sep 03, 2008 at 20:55 UTC

    Assuming (as in my previous post) that the phylogenetic tree is a binary tree, the following code should work.

    use strict; use warnings; use Data::Dumper; my $sites={'mouse'=>'CGGCGGAAAACTGTCCTCCGTGC', 'rat'=> 'CGACGGAACATTCTCCTCCGCGC', 'human'=>'CGACGGAATATTCCCCTCCGTGC', 'chimp'=>'CGACGGAAGACTCTCCTCCGTGC'}; my $tree = [['mouse','rat'],['human','chimp']]; # Sanity checks my $len=-1; treeCheck($tree,$sites,\$len); my @val; $val[$_]=parsimony(extractLocus($tree,$sites,$_))->[0] for (0..$len-1) +; print join('',@val)."\n"; sub treeCheck { my ($subTree,$sites,$len)=@_; if (ref($subTree) eq 'ARRAY') { die "Not a binary tree" if @$subTree != 2; treeCheck($subTree->[0],$sites,$len); treeCheck($subTree->[1],$sites,$len); } else { die "Invalid char" if $sites->{$subTree} !~ /^[ATCG]+$/; if ($$len<0) { $$len=length($sites->{$subTree}); } else { die "Invalid length" if length($sites->{$subTree}) != $$len; } } } sub extractLocus { my ($subTree,$sites,$index)=@_; if (ref($subTree) eq 'ARRAY') { return [extractLocus($subTree->[0],$sites,$index),extractLocus($su +bTree->[1],$sites,$index)]; } else { my $base=substr($sites->{$subTree},$index,1); $base =~ tr/ATCG/1248/; return $base; } } sub parsimony { my $locusTree=shift; if (ref($locusTree) eq 'ARRAY') { my $left=parsimony($locusTree->[0]); my $right=parsimony($locusTree->[1]); my $count=$left->[0]+$right->[0]; $count++ if ($left->[1] & $right->[1])==0; my $bases=$left->[1] | $right->[1]; return [$count,$bases] } else { return[0,$locusTree]; } }

    I have significantly changed the input data structures, so $tree is "really" a tree (an AoA..oA) containing the names of the species and $sites is an hash using the species as key. It works with your original data but changing the structure of the tree gives (as to be expected) different results:

    With $tree = [['mouse','rat'],['human','chimp']]; gives 00100000302011000000100 With $tree = ['mouse',['rat',['human','chimp']]]; gives 00100000301011000000100

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Re: Maximal Parsimony Problem
by bioinformatics (Friar) on Sep 03, 2008 at 20:19 UTC
    So, if you use the fixes from the posts above, you should be able to output a minimum change in each base from the transcription factor binding site. In the top up phase it looks like you are clustering the individual bases to determine the sequence variability and relations in the sequence itself. By recursively clustering these, you should be able to output a second value that serves as a "score" for the sequence variation, which then allows you to cluster again in the top down version to determine the genotypic relation of the various species. A simple way to cluster them would be to use a sub to recursively find the difference of the "values" your getting for each base associated with each species, pushing them into an array, and them using them to create a second value which is then used to determine the larger phylogenetic tree. It would be more efficient to use a matrix to sort and output those values so that the changes would be more fluid and dynamic and use less memory, etc. Bioperl has a couple modules that are already made to determine the distances in the various nodes, so you could integrate these into your program to get the larger phylogenetic output you desire.

    Check out Module:Bio::Tree::Node, as well as Module:Bio::Align::DNAStatistics, which if I recall can do something of what your trying to do above. Good luck!

    Bioinformatics
Re: Maximal Parsimony Problem
by GrandFather (Sage) on Sep 03, 2008 at 21:57 UTC

    I have no understanding of the problem domain so the following code is not a solution to your actual problem, but is an alternative way of finding changes between adjacent strings in a list and counting those changes by column.

    use strict; use warnings; my $sites = [ 'CGGCGGAAAACTGTCCTCCGTGC', # mouse 'CGACGGAACATTCTCCTCCGCGC', # rat 'CGACGGAATATTCCCCTCCGTGC', # human 'CGACGGAAGACTCTCCTCCGTGC', # chimp ]; my @val = my_parsimony ($sites); print @val; sub my_parsimony { my ($sites) = @_; my $lastSite = $sites->[0]; my @mincol = (0) x length $lastSite; for my $nextSite (@$sites) { my $changes = $lastSite ^ $nextSite; $lastSite = $nextSite; next unless $changes =~ s/[^\0]/+/g; my $pos = 0; ++$mincol[$pos++] while -1 != ($pos = index $changes, '+', $po +s); } return @mincol; }

    Prints:

    00100000302012000000200

    Perl reduces RSI - it saves typing
Re: Maximal Parsimony Problem
by SFLEX (Chaplain) on Sep 03, 2008 at 16:58 UTC
    What I think is happening is since uniq() will return a 1 if all in that column is the same, but what i think the bug is.
    If all in that column is the same you want it to return a zero (scalar( uniq(@colbp) ) - 1;), but the problem with that is if there all different for that column with 4 rows you would want a 4 returned?


    edited spelling.. =p

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://708758]
Approved by lamp
Front-paged by monkfan
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (8)
As of 2020-02-29 10:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (128 votes). Check out past polls.

    Notices?