Just another Perl shrine PerlMonks

### supertree construction in perl

 on Oct 07, 2012 at 20:06 UTC Need Help??
zing has asked for the wisdom of the Perl Monks concerning the following question:

Hi all , I had this problem of creating a supertree by recursive Alfred Aho's algorithm.So I divided the problem into its the key concepts and dealt with them one by one. I'll first put up my code I built with help of people on this forum and then later on I'll explain my problem.So here's what I've build till now
```# perl code.pl data
# ___data____
#b c a
#a c d
#d e b
#This program read the triplets from file named "data" and returns the
#supertree.
#### NOTE ::: SuperTree part hasnt been incorporated yet.
use strict;
use warnings;
use Data::Dumper;
use Graph;
use Data::Dump qw/ pp /;
####READ IN THE INPUT DATA ########
my @triplets; # Get all the triplets
while (<>) {
push @triplets, [ split ];
}

#Make a deep copy of @triplets
my @triplet_deep_copy = map { [@\$_] } @triplets;

#####AUXILIARY GRAPH   G(L) #######
# In order to generate the G(L)  first of all extract first two column
+s of  @triplets into another matrix
my @auxiliary_edges=@triplets;
splice(@\$_, 2, 1)
foreach @auxiliary_edges;
print "----EDGE LIST TO BUILD AUXILIARY GRAPH-----\n";
print Dumper \@auxiliary_edges;

##### CONNECTED COMPONENTS ##########
my \$auxiliary_graph = Graph->new( undirected => 1 );

my @from;
my @to;
for (my \$p = 0; \$p <= 2; \$p++) {
\$from[\$p]=\$triplets[\$p][0];
}

for (my \$q = 0; \$q <= 2; \$q++) {
\$to[\$q]=\$triplets[\$q][1];
}

for (my \$r = 0; \$r <= 2; \$r++) {
}

my @subgraphs = \$auxiliary_graph->connected_components; # Subgraphs
my \$V = \$auxiliary_graph->vertices; # Number of taxa
my \$connected_components=scalar @subgraphs; #Get the number of connect
+ed components

###### FINDING THE TRIPLETS WHICH ARE SUBSET(OR INDUCED BY) OF EACH OF
+ THE CONNECTED COMPONENTS######
Main(@auxiliary_edges);
exit(0);
sub induced {
my \$trip = shift;
my @matches;
for my \$QT ( @_ ) {
for my \$triplet ( @\$trip ) {
my %seen;        # my %Pie;
undef @seen{@\$QT};
delete @seen{@\$triplet};
if ( keys( %seen ) <= ( @\$QT - @\$triplet ) ) {
push @matches, \$triplet;
}
} ## end for my \$triplet ( @\$trip )
} ## end for my \$QT ( @_ )
return @matches;
}## end sub induced

sub Main {
my \$tree = Graph->new( undirected => 1 );
for my \$one (@subgraphs) {
my @matches = induced( \@triplet_deep_copy, \$one );
print "\nTriplet induced by subgraph ", pp( \$one => { MATCHES =>
+ \@matches } ), "\n\n";
}
}
So this is what I have written till now. Now let me explain my problem.
```___INPUT(set of triplets)____
b c a
a c d
d e b

Set of species/vertices=a,b,c,d,e

Now build the auxiliary graph by selecting first two vertices of each of the triplets,i.e.
```b c
a c
d e
The auxiliary graph thus generated will be  a-c-b  d-e The number of connected components in this auxiliary graph (q)=2 (viz. a-c-b and d-e) The algorithm I need to implement is this:-
```TreeConstruct(S)
1. Let L be the set of species appear in S. Build G(L)
2. Let C1 , C2 , . . . , Cq be the set of connected components in G(L)
3. If q >1, then
• For i = 1, 2, . . . , q, let Si be the set of triplets in S labeled
+by the set
of leaves in Ci .
• Let Ti = TreeConstruct(Si )
• Let T be a tree formed by connecting all Ti with the same parent nod
+e.
Return T.
4. If q=1 and C1 contains exactly one leaf, return the leaf; else retu
+rn fail.
The progression will be like this:-
```1. Initially we have q=2 (a-c-b & d-e). So introduce an internal verte
+x (u) and make these connected components child of u.
u=> a-c-b;
d-e;
2. Select component 1 = a-c-b. Check all lines from INPUT which are a
+subset of this component1.First line of INPUT i.e. "b c a" is a subse
+t of component1.
3."b c a" now becomes the INPUT for the program and it is recursed aga
+in with this INPUT(Now for input "b c a" the auxiliary graph will be
+"b-c" & "a",i.e. two connected components,thus q=2 ...)
Final output(SUPERTREE) for the given input should be like this
```u  => u => d
=> e
=> u => a
=> u => b
=> c

TRIPLETS(input) and SUPERTREE(output) look like these

http://ars.sciencedirect.com/content/image/1-s2.0-S0166218X10000983-gr1.jpg

The picture link above has the exact triplets for my problem and the exact supertree(output) expected

Replies are listed 'Best First'.
Re: supertree construction in perl
by BrowserUk (Pope) on Oct 07, 2012 at 20:31 UTC

You keep asking this question, but until you can describe your problem in English -- rather than dissertationease -- you're unlikely to get a solution.

You have not yet described what "a triple" is or signifies; nor provided links to a viable description of "Alfred Aho's algorithm".

And you keep introducing specialist terms -- induced subgraphs, supertrees -- that will only make sense to speciologists.

To get help with this, you will need to step back from the specialisation and explain the problem in terms that programmers will understand.

I'm reasonably adept at reading between the BS that these types of papers are couched in, and I did gain some insights into the nature of the problem you are trying to tackle, but in terms of trying to turn the (so-called) algorithm descriptions in those papers into a practical implementation. I'm none the wiser.)

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
BrowserUK thanks for letting me know the trouble Im causing. So here's it is sir

You just need to read the first 4 pages.Its a very quick read (not a lengthy research paper) and the most relevant and elaborate explanation on the algorithm and the terms. Hope it helps

TRIPLETS(input) and SUPERTREE(output) look like these

http://ars.sciencedirect.com/content/image/1-s2.0-S0166218X10000983-gr1.jpg

The picture link above has the exact triplets for my problem and the exact supertree(output) expected

Hi BrowserUk, Did you find any solution to my problem. Please help me on this. I have dealt with all the aspects of the algorithm in isolation but am not able to put it all together(as is required by the algorithm). Help me on this.

Sorry, but as I said, I find the "algorithm description" in the paper you cited totally incomprehensible. It could be an alchemist's recipe or Archimedes' exclamation when stubbed his toe getting out of the bath for all the sense I can make of it. Constructed to impress rather than convey information. Further, I personally think that (much of) the literature on the subject is wrong, but I haven't reached the point of proof yet.

I did come across this description of Aho's algorithm which at a cursory glance seems to be infinitely clearer:

The following procedure is an implementation of the algorithm presented by Aho et al. (1981) with modifications for dealing with incompatibilities.
1. A subset of the orthologous-repeats table is created, in which only “relevant” rows (organisms) are considered (initially all rows, since all organisms are being considered). Within this subset of rows, only those columns in which at least two rows have a 1 and one row has a 0 are considered.
2. Using this subset of the original repeat occurrence table, a graph is created by iterating through the columns. If two rows both have a 1 in a given column, an edge of weight 1 is created between the two corresponding organisms. If an edge already exists between those two organisms, its weight is incremented by 1.
3. Multiple connected components are sought within the graph. If the graph contains a single connected component, weak edges must be eliminated. This is accomplished by removing edges, beginning with those of weight 1 and incrementally removing edges of greater weight, until multiple connected components arise.
4. Steps 1-3 are recursively applied to each connected component containing more than two organisms. The “relevant” rows in each run are the organisms within the connected component.

Consider the above example illustrated in Table 3B. The phylogenetic inference of column i is supported by column h, and column k is supported by column l. Thus, in the shared-repeat graph, edges (A, B) and (C, D) have weight 2, while the edge (A, C) has weight 1. Removing the minimum weight edge is akin to removing column j, which has the least support.

I haven't pursued it because I'd rather finish or abandon my own experiments before I get familiar with someone else's method; but maybe it will help you to bring your code together. Good luck.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: supertree construction in perl
by thomas895 (Chaplain) on Oct 08, 2012 at 07:24 UTC

As for your code, I think you need to choose better variable names. I have great difficulty reading code that just looks like random characters(people often ask me, "thomas, why do you use Perl?"). I hope this isn't supposed to end up as an obfuscation. You will eventually forget the meaning of those names, and then your code becomes a confusing mess. Avoid "a" and "b" aa variable names unless that is in context with sort.
DEBUG() seems to be an empty subroutine. Is that what you intended?

P.S.: Ignore the first part if you want. I'm tired, which causes my writeups to become long and ramble-y. I'll probably remove it tomorrow, after thinking "why would I write that?"

~Thomas~
confess( "I offer no guarantees on my code." );
Hi Thomas, Yes you are very true. So I have edited the code properly this time and introduces proper variable names and indentation.
```#This program read the triplets from file named "data" and returns the
#supertree.
# ___DATA(triplets)____
#b c a
#a c d
#d e b
#### NOTE ::: SuperTree part hasnt been incorporated yet.
use strict;
use warnings;
use Data::Dumper;
use Graph;
use Data::Dump qw/ pp /;
####READ IN THE INPUT DATA ########
my @triplets; # Get all the triplets
while (<>) {
push @triplets, [ split ];
}

#Make a deep copy of @triplets
my @triplet_deep_copy = map { [@\$_] } @triplets;

#####AUXILIARY GRAPH   G(L) #######
# In order to generate the G(L)  first of all extract first two column
+s of  @triplets into another matrix
my @auxiliary_edges=@triplets;
splice(@\$_, 2, 1)
foreach @auxiliary_edges;
print "----EDGE LIST TO BUILD AUXILIARY GRAPH-----\n";
print Dumper \@auxiliary_edges;

##### CONNECTED COMPONENTS ##########
my \$auxiliary_graph = Graph->new( undirected => 1 );

my @from;
my @to;
for (my \$p = 0; \$p <= 2; \$p++) {
\$from[\$p]=\$triplets[\$p][0];
}

for (my \$q = 0; \$q <= 2; \$q++) {
\$to[\$q]=\$triplets[\$q][1];
}

for (my \$r = 0; \$r <= 2; \$r++) {
}

my @subgraphs = \$auxiliary_graph->connected_components; # Subgraphs
my \$V = \$auxiliary_graph->vertices; # Number of taxa
my \$connected_components=scalar @subgraphs; #Get the number of connect
+ed components

###### FINDING THE TRIPLETS WHICH ARE SUBSET(OR INDUCED BY) OF EACH OF
+ THE CONNECTED COMPONENTS######
Main(@auxiliary_edges);
exit(0);
sub induced {
my \$trip = shift;
my @matches;
for my \$QT ( @_ ) {
for my \$triplet ( @\$trip ) {
my %seen;        # my %Pie;
undef @seen{@\$QT};
delete @seen{@\$triplet};
if ( keys( %seen ) <= ( @\$QT - @\$triplet ) ) {
push @matches, \$triplet;
}
} ## end for my \$triplet ( @\$trip )
} ## end for my \$QT ( @_ )
return @matches;
}## end sub induced

sub Main {
my \$tree = Graph->new( undirected => 1 );
for my \$one (@subgraphs) {
my @matches = induced( \@triplet_deep_copy, \$one );
print "\nTriplet induced by subgraph ", pp( \$one => { MATCHES =>
+ \@matches } ), "\n\n";
}
}
Re: supertree construction in perl
by pvaldes (Chaplain) on Oct 08, 2012 at 20:07 UTC

I wonder what you want to do with the supertree. It's the plot your only goal or you want to measure something or make some index?

How many nodes max do you need/expect to plot?

Valdes, Right now plotting is the only goal. Right now 50 nodes would be more than enough. The maximum I'd be going is 100 nodes.
Re: supertree construction in perl
by pvaldes (Chaplain) on Oct 09, 2012 at 09:00 UTC

If you want to play with the nodes, calculate bootstrap values etc, start with bioperl

But if you want only to plot a nice tree with few nodes and non interactive, you have at least two superb options outside perl. Each time I see the plot that you want to make, something in my mind is whispering at me: this is clearly either a graphviz job...

```use graphviz;
my \$plot = GraphViz->new(
layout   => 'dot',
directed =>  0,
rankdir  => 'TB'
);
# dot = directed plot as output,
# better than 'neato' for drawing trees

... or either a work for the ever elegant latex

use LaTeX::Driver;

(\usepackage{tikz}). Both can produce the plot that you want, after perl massage and pass the data in an accurate way. Of these, Latex is probably the only that can plot _exactly_ what you show in the picture, but don't underestimate graphviz, it is a solid and very powerful program

And if you have a big structure with many thousands or millions of nodes to plot, R is your man. Graphviz is not designed for this kind of situation.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://997708]
Approved by Corion
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2018-01-20 05:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
How did you see in the new year?

Results (226 votes). Check out past polls.

Notices?