Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Parsing newick trees

by citromatik (Curate)
on Oct 17, 2008 at 15:27 UTC ( #717769=perlquestion: print w/replies, xml ) Need Help??

citromatik has asked for the wisdom of the Perl Monks concerning the following question:

Hi all

I'm trying to write a script that parses recursively simple newick binary trees. For example, having:

(A,(B,C))

I want to obtain a data structure that represents the binary tree, like:

$VAR1 = { 'LEFT' => { 'VALUE' => 'A' }, 'RIGHT' => { 'LEFT' => { 'VALUE' => 'B' }, 'RIGHT' => { 'VALUE' => 'C' } } };

I end up with this small script:

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my $leaf = qr/([A-Z]+)/; my $node = qr/(\(.+\))/; my $leaf_leaf = qr/^\($leaf,$leaf\)$/; my $node_leaf = qr/^\($node,$leaf\)$/; my $leaf_node = qr/^\($leaf,$node\)$/; my $node_node = qr/^\($node,$node\)$/; my $nt = "((A,(B,((C,(D,E)),(F,G)))),H)"; my $tree = newick ($nt,undef); print Dumper $tree; sub newick { my ($t,$tree) = @_; if ($t =~ /$leaf_node/){ my ($leaf,$node) = ($1,$2); ins_Leaf_Node ($tree,$leaf, newick ($node,$tree)); } elsif ($t =~ /$node_leaf/){ my ($node,$leaf) = ($1,$2); ins_Node_Leaf ($tree,newick ($node,$tree),$leaf); } elsif ($t =~ /$node_node/){ my ($node1,$node2) = ($1,$2); ins_Node_Node ($tree,newick ($node1,$tree), newick ($node2,$tree)) +; } elsif ($t =~ /$leaf_leaf/){ my ($leaf1,$leaf2) = ($1,$2); ins_Leaf_Leaf ($tree,$leaf1,$leaf2); } else { die "Unrecognized tree branch $t\n"; } } sub create_leaf { my ($value) = @_; return { 'VALUE' => $value, }; } sub ins_Node_Node { my ($tree,$left,$right) = @_; $tree->{LEFT} = $left; $tree->{RIGHT} = $right; return $tree; } sub ins_Node_Leaf { my ($tree,$left,$right) = @_; $tree->{LEFT} = $left; $tree->{RIGHT} = create_leaf ($right); return $tree; } sub ins_Leaf_Node { my ($tree,$left,$right) = @_; $tree->{LEFT} = create_leaf ($left); $tree->{RIGHT} = $right; return $tree; } sub ins_Leaf_Leaf { my ($tree,$left,$right) = @_; my $nodeR = create_leaf ($right); my $nodeL = create_leaf ($left); $tree->{LEFT} = $nodeL; $tree->{RIGHT} = $nodeR; return $tree; }

But I have the intuition that there must be a simpler way (or at least less verbose) to solve this.

Any further discussion on how to solve this problem would be welcome

citromatik

Replies are listed 'Best First'.
Re: Parsing newick trees
by kyle (Abbot) on Oct 17, 2008 at 15:46 UTC

    This is really quick and dirty. Basically, if we can assume that the format is very strict (values don't have any quoting, and commas and parentheses are always in sane places), you can just do some replacements on the input and then eval out a result.

    use Data::Dumper; my $tree_in = '(A,(B,C))'; #my $tree_in = "((A,(B,((C,(D,E)),(F,G)))),H)"; $tree_in =~ s/([^,()]+)/\{VALUE=>"$1"\}/g; $tree_in =~ s/\(/\{LEFT=>/g; $tree_in =~ s/,/,RIGHT=>/g; $tree_in =~ s/\)/\}/g; print $tree_in, "\n"; my $tref = eval $tree_in; print Dumper $tref; __END__ {LEFT=>{VALUE=>"A"},RIGHT=>{LEFT=>{VALUE=>"B"},RIGHT=>{VALUE=>"C"}}} $VAR1 = { 'LEFT' => { 'VALUE' => 'A' }, 'RIGHT' => { 'LEFT' => { 'VALUE' => 'B' }, 'RIGHT' => { 'VALUE' => 'C' } } };

    As I say, it's quick and dirty, there's no input validation, and the door's wide open to hostile inputs (think, "run arbitrary Perl code"). It wouldn't be too hard to put some rules on the front end to make some of these problems go away, but in the end you may be happier with a more complete parsing solution.

Re: Parsing newick trees
by jdporter (Chancellor) on Oct 17, 2008 at 18:47 UTC

    Cool - We get a chance to use the "seldom used" ForceContent=>1 and ContentKey=>'keyname' options of XMLin. :-)

    use XML::Simple; sub parse_newick { local $_ = shift; s<[(]>[<LEFT>]g; s<[,]>[</LEFT><RIGHT>]g; s<[)]>[</RIGHT>]g; XMLin "<opt>$_</opt>", ForceContent => 1, ContentKey => 'VALUE'; } use Data::Dumper; print Dumper parse_newick( '(A,(B,C))' );
    Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.
Re: Parsing newick trees
by BrowserUk (Patriarch) on Oct 17, 2008 at 18:17 UTC

    A non-eval solution. This constructs a AoA. I'd define constants: use constant { LEFT => 0, RIGHT => 1 }; rather than use a HoH for this, but converting the code to produce a HoH is trivial. As coded, it does require an absence of spaces, so strip them first if they might be present:

    Updated with a slightly less verbose version

    #! perl -slw use strict; use Data::Dumper; sub newickFromString { my $in = shift; return $in unless $in =~ tr[(),][(),]; my( $depth, $p ) = 0; for ( 0 .. length( $in ) -1 ) { ## find split point my $c = substr $in, $_, 1; ++$depth if $c eq '('; --$depth if $c eq ')'; $p = $_ and last if $c eq ',' and $depth == 1; } return [ newickFromString( substr $in, 1, $p-1 ), newickFromString( substr $in, $p+1, -1 ) ]; } my $input = "((Alpha,(Bravo,((Charlie,(Delta,Echo)),(Foxtrot,Golf)))), +Hotel)"; my $newick = newickFromString( $input ); print Dumper $newick; __END__ c:\test>junk8 $VAR1 = [ [ 'Alpha', [ 'Bravo', [ [ 'Charlie', [ 'Delta', 'Echo' ] ], [ 'Foxtrot', 'Golf' ] ] ] ], 'Hotel' ];

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Rather than doing all the substr's, couldn't you just do
      $in = s/^\((.*)\)$/$1/; my @splat = split(/,/, $in); my ( $depth, $left ); while ( my $c = shift @splat ) { $left .= ( $left ? ',' : '' ) . $c; ++$depth if index($c, '(') >= $[; --$depth if index($c, ')') >= $[; last if $depth == 0; } my $right = join(',', @splat);
      and then recurse on $left and $right? (Maybe searching each time for the parentheses is as hard as substr'ing, though.)

      UPDATE: Per BrowserUK's response, the answer seems to be “Yes, but why?” I always thought that substr's were very expensive, but apparently not. Hurrah for premature optimisation!

Re: Parsing newick trees
by dorko (Prior) on Oct 17, 2008 at 15:43 UTC
    I've never heard of newick trees before, but CPAN seems to know about newick trees.

    Cheers,

    Brent

    -- Yeah, I'm a Delt.

      Well, thanks for the tip. I have done this search before. The modules you (and I) found in CPAN are part of the Bioperl suite, and surely because of that, the code you can find there is far more obfuscated (well, maybe this is not exactly the word) than mine.

      citromatik

        I think the phrase (instead of 'obfuscated') you want is 'general, debugged, maintained, comprehensive and written by someone else so I don't need to bother'.

        Probably Bio::Phylo::IO (the module you would actually use) doesn't use exactly the structure you currently use (or think you would like to use) for the internal representation of the trees. But it is likely that the structure the module uses is more generally useful and that the work you want to perform on the structure is already supported among the Bioperl modules.

        Update: It is true I've not used Bioperl modules. I looked at the docs for a few before I answered and on the face of it they seemed at reasonably documented. I was reacting to what appeared to me to be a "modules are too complicated" and NIH attitude in the OP's node. My apologies.


        Perl reduces RSI - it saves typing
Re: Parsing newick trees
by casiano (Pilgrim) on Oct 17, 2008 at 19:02 UTC
    Is not the structure you want but the method new of Parse::Eyapp::Node builds a similar structure:
    $ perl -wde 0 main::(-e:1): 0 DB<1> use Parse::Eyapp::Node DB<2> x Parse::Eyapp::Node->new('A(B,C)') 0 A=HASH(0x860b980) 'children' => ARRAY(0x861c874) 0 B=HASH(0x8628568) 'children' => ARRAY(0x83284ec) empty array 1 C=HASH(0x861c928) 'children' => ARRAY(0x852d8a8) empty array 1 B=HASH(0x8628568) -> REUSED_ADDRESS 2 C=HASH(0x861c928) -> REUSED_ADDRESS
Re: Parsing newick trees
by AnomalousMonk (Archbishop) on Oct 20, 2008 at 12:56 UTC
    Here's a regex-only approach packaged as a module. (I can post .t and demo files if anyone is interested.) Don't know if it will be of any use, but it was an interesting problem - many thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://717769]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2023-02-08 19:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (43 votes). Check out past polls.

    Notices?