citromatik has asked for the wisdom of the Perl Monks concerning the following question:
Hi all
I'm trying to write a script that parses recursively simple newick binary trees. For example, having:
(A,(B,C))
I want to obtain a data structure that represents the binary tree, like:
$VAR1 = {
'LEFT' => {
'VALUE' => 'A'
},
'RIGHT' => {
'LEFT' => {
'VALUE' => 'B'
},
'RIGHT' => {
'VALUE' => 'C'
}
}
};
I end up with this small script:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $leaf = qr/([A-Z]+)/;
my $node = qr/(\(.+\))/;
my $leaf_leaf = qr/^\($leaf,$leaf\)$/;
my $node_leaf = qr/^\($node,$leaf\)$/;
my $leaf_node = qr/^\($leaf,$node\)$/;
my $node_node = qr/^\($node,$node\)$/;
my $nt = "((A,(B,((C,(D,E)),(F,G)))),H)";
my $tree = newick ($nt,undef);
print Dumper $tree;
sub newick {
my ($t,$tree) = @_;
if ($t =~ /$leaf_node/){
my ($leaf,$node) = ($1,$2);
ins_Leaf_Node ($tree,$leaf, newick ($node,$tree));
} elsif ($t =~ /$node_leaf/){
my ($node,$leaf) = ($1,$2);
ins_Node_Leaf ($tree,newick ($node,$tree),$leaf);
} elsif ($t =~ /$node_node/){
my ($node1,$node2) = ($1,$2);
ins_Node_Node ($tree,newick ($node1,$tree), newick ($node2,$tree))
+;
} elsif ($t =~ /$leaf_leaf/){
my ($leaf1,$leaf2) = ($1,$2);
ins_Leaf_Leaf ($tree,$leaf1,$leaf2);
} else {
die "Unrecognized tree branch $t\n";
}
}
sub create_leaf {
my ($value) = @_;
return {
'VALUE' => $value,
};
}
sub ins_Node_Node {
my ($tree,$left,$right) = @_;
$tree->{LEFT} = $left;
$tree->{RIGHT} = $right;
return $tree;
}
sub ins_Node_Leaf {
my ($tree,$left,$right) = @_;
$tree->{LEFT} = $left;
$tree->{RIGHT} = create_leaf ($right);
return $tree;
}
sub ins_Leaf_Node {
my ($tree,$left,$right) = @_;
$tree->{LEFT} = create_leaf ($left);
$tree->{RIGHT} = $right;
return $tree;
}
sub ins_Leaf_Leaf {
my ($tree,$left,$right) = @_;
my $nodeR = create_leaf ($right);
my $nodeL = create_leaf ($left);
$tree->{LEFT} = $nodeL;
$tree->{RIGHT} = $nodeR;
return $tree;
}
But I have the intuition that there must be a simpler way (or at least less verbose) to solve this.
Any further discussion on how to solve this problem would be welcome
citromatik
Re: Parsing newick trees
by kyle (Abbot) on Oct 17, 2008 at 15:46 UTC
|
This is really quick and dirty. Basically, if we can assume that the format is very strict (values don't have any quoting, and commas and parentheses are always in sane places), you can just do some replacements on the input and then eval out a result.
use Data::Dumper;
my $tree_in = '(A,(B,C))';
#my $tree_in = "((A,(B,((C,(D,E)),(F,G)))),H)";
$tree_in =~ s/([^,()]+)/\{VALUE=>"$1"\}/g;
$tree_in =~ s/\(/\{LEFT=>/g;
$tree_in =~ s/,/,RIGHT=>/g;
$tree_in =~ s/\)/\}/g;
print $tree_in, "\n";
my $tref = eval $tree_in;
print Dumper $tref;
__END__
{LEFT=>{VALUE=>"A"},RIGHT=>{LEFT=>{VALUE=>"B"},RIGHT=>{VALUE=>"C"}}}
$VAR1 = {
'LEFT' => {
'VALUE' => 'A'
},
'RIGHT' => {
'LEFT' => {
'VALUE' => 'B'
},
'RIGHT' => {
'VALUE' => 'C'
}
}
};
As I say, it's quick and dirty, there's no input validation, and the door's wide open to hostile inputs (think, "run arbitrary Perl code"). It wouldn't be too hard to put some rules on the front end to make some of these problems go away, but in the end you may be happier with a more complete parsing solution. | [reply] [d/l] |
Re: Parsing newick trees
by jdporter (Chancellor) on Oct 17, 2008 at 18:47 UTC
|
Cool - We get a chance to use the
"seldom used" ForceContent=>1 and ContentKey=>'keyname' options of XMLin. :-)
use XML::Simple;
sub parse_newick {
local $_ = shift;
s<[(]>[<LEFT>]g; s<[,]>[</LEFT><RIGHT>]g; s<[)]>[</RIGHT>]g;
XMLin "<opt>$_</opt>", ForceContent => 1, ContentKey => 'VALUE';
}
use Data::Dumper;
print Dumper parse_newick( '(A,(B,C))' );
Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.
| [reply] [d/l] [select] |
Re: Parsing newick trees
by BrowserUk (Patriarch) on Oct 17, 2008 at 18:17 UTC
|
A non-eval solution. This constructs a AoA. I'd define constants: use constant { LEFT => 0, RIGHT => 1 }; rather than use a HoH for this, but converting the code to produce a HoH is trivial. As coded, it does require an absence of spaces, so strip them first if they might be present:
Updated with a slightly less verbose version
#! perl -slw
use strict;
use Data::Dumper;
sub newickFromString {
my $in = shift;
return $in unless $in =~ tr[(),][(),];
my( $depth, $p ) = 0;
for ( 0 .. length( $in ) -1 ) { ## find split point
my $c = substr $in, $_, 1;
++$depth if $c eq '(';
--$depth if $c eq ')';
$p = $_ and last if $c eq ',' and $depth == 1;
}
return [
newickFromString( substr $in, 1, $p-1 ),
newickFromString( substr $in, $p+1, -1 )
];
}
my $input = "((Alpha,(Bravo,((Charlie,(Delta,Echo)),(Foxtrot,Golf)))),
+Hotel)";
my $newick = newickFromString( $input );
print Dumper $newick;
__END__
c:\test>junk8
$VAR1 = [
[
'Alpha',
[
'Bravo',
[
[
'Charlie',
[
'Delta',
'Echo'
]
],
[
'Foxtrot',
'Golf'
]
]
]
],
'Hotel'
];
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
Rather than doing all the substr's, couldn't you just do
$in = s/^\((.*)\)$/$1/;
my @splat = split(/,/, $in);
my ( $depth, $left );
while ( my $c = shift @splat ) {
$left .= ( $left ? ',' : '' ) . $c;
++$depth if index($c, '(') >= $[;
--$depth if index($c, ')') >= $[;
last if $depth == 0;
}
my $right = join(',', @splat);
and then recurse on $left and $right? (Maybe searching each time for the parentheses is as hard as substr'ing, though.)
UPDATE: Per BrowserUK's response, the answer seems to be “Yes, but why?” I always thought that substr's were very expensive, but apparently not. Hurrah for premature optimisation! | [reply] [d/l] [select] |
|
| [reply] |
Re: Parsing newick trees
by dorko (Prior) on Oct 17, 2008 at 15:43 UTC
|
I've never heard of newick trees before, but CPAN seems to know about newick trees.
Cheers,
Brent
| [reply] |
|
| [reply] |
|
Well, thanks for the tip. I have done this search before. The modules you (and I) found in CPAN are part of the Bioperl suite, and surely because of that, the code you can find there is far more obfuscated (well, maybe this is not exactly the word) than mine.
citromatik
| [reply] |
|
I think the phrase (instead of 'obfuscated') you want is 'general, debugged, maintained, comprehensive and written by someone else so I don't need to bother'.
Probably Bio::Phylo::IO (the module you would actually use) doesn't use exactly the structure you currently use (or think you would like to use) for the internal representation of the trees. But it is likely that the structure the module uses is more generally useful and that the work you want to perform on the structure is already supported among the Bioperl modules.
Update: It is true I've not used Bioperl modules. I looked at the docs for a few before I answered and on the face of it they seemed at reasonably documented. I was reacting to what appeared to me to be a "modules are too complicated" and NIH attitude in the OP's node. My apologies.
Perl reduces RSI - it saves typing
| [reply] |
|
|
Re: Parsing newick trees
by casiano (Pilgrim) on Oct 17, 2008 at 19:02 UTC
|
Is not the structure you want but the method new of Parse::Eyapp::Node builds a similar structure:
$ perl -wde 0
main::(-e:1): 0
DB<1> use Parse::Eyapp::Node
DB<2> x Parse::Eyapp::Node->new('A(B,C)')
0 A=HASH(0x860b980)
'children' => ARRAY(0x861c874)
0 B=HASH(0x8628568)
'children' => ARRAY(0x83284ec)
empty array
1 C=HASH(0x861c928)
'children' => ARRAY(0x852d8a8)
empty array
1 B=HASH(0x8628568)
-> REUSED_ADDRESS
2 C=HASH(0x861c928)
-> REUSED_ADDRESS
| [reply] [d/l] |
Re: Parsing newick trees
by AnomalousMonk (Archbishop) on Oct 20, 2008 at 12:56 UTC
|
Here's a regex-only approach packaged as a module. (I can post .t and demo files if anyone is interested.) Don't know if it will be of any use, but it was an interesting problem - many thanks.
| [reply] [d/l] |
|
|