Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

please help me parse genbank DNA file

by foxy (Initiate)
on Apr 11, 2002 at 14:09 UTC ( #158312=perlquestion: print w/replies, xml ) Need Help??

foxy has asked for the wisdom of the Perl Monks concerning the following question:

i am trying to write a script that will parse a genbank DNA file, but when i run the program it won't run. here's the code;
#! usr/local/bin/perl -w use strict; my $file; my $seq; my @annoation; print "please type in the name of a file\n"; $file = <STDIN>; get_dna (@annotation, $file,$seq); print $seq; sub get_dna { my ($line, $seq) = @_; my $in_sequence = 0; foreach my $line ($file) { if ($line =~ /^\/\/n/) { last; } elsif ($in_sequence) { $$seq .= $line; } elsif ($line =~ /^ORIGIN/) { $seq = 1; } else { push (@annotation, $line); } } $$seq =~ s/[\s0-9]//g; }

Edit kudra, 2002-04-15 Changed title

Replies are listed 'Best First'.
Re: please help me!!
by Fletch (Bishop) on Apr 11, 2002 at 14:21 UTC

    • Including what the error message is will get better replies.
    • perl -wc foo can be enlightening.
    • `@annotation' and `@annoation' are two different things
    • All through get_dna you're treating (or trying to treat at least) $seq as a scalar reference, but you don't pass a scalar reference when you call it
    • m{^//n} is easier to read than something with all the backwhacking of `/'s.
    • In your second elsif you set $seq when you probably meant to set $in_sequence
Re: please help me!!
by PrimeLord (Pilgrim) on Apr 11, 2002 at 14:27 UTC
    Well I see several problems with your script.

    First your missing a / in your shebang statement.

    When you declare the annotation array at the top of the script you have it mispelled. Your warnings should have pointed out to you that @annotation wasn't declared.

    Your also passing your variables into your get_dna sub wrong. @_ is going to return the variables in the same order you passed them so the scalar value of @annotation is getting passed into $line and $file is getting passed into $seq. Atleast I think you didn;t mean to do that.

    A better way to do this would be to just declare your @annotation array inside your get_dna sub and then if you need to access it outside of that sub again just return an array ref.


Re: please help me!!
by tachyon (Chancellor) on Apr 11, 2002 at 15:19 UTC

    You probably want something like this but as you don't describe what this code is supposed to do (and it is somewhat incomprehensible) I am kinda guessing

    #!/usr/local/bin/perl -w use strict; print "Please type in the name of a file\n"; my $file = <STDIN>; chomp $file; my ($annotation, $seq) = get_dna ($file); # convert $annotation array reference to a real array my @annotation = @$annotation; print "Annotation:\n\n", @annotation, "\n\n\n"; print "Sequence:\n\n", $seq, "\n"; sub get_dna { my ($file) = @_; my @annotation = (); my $seq = ''; my $in_sequence = 0; open FILE, $file or die "Can't open $file, Perl says $!\n"; while ( my $line = <FILE> ) { last if $line =~ m|^//\n|; # stop if line has just // on it $in_sequence = 1 if $line =~ m/^ORIGIN/; if ($in_sequence) { $seq .= $line; } else { push @annotation, $line; } } close FILE; # remove all spaces and digits from $seq. add \n to remove newline +s $seq =~ s/[\s0-9]//g; # return @annotation array as a scalar reference and scalar $seq return \@annotation, $seq; }




Re: please help me!!
by robsv (Curate) on Apr 11, 2002 at 18:15 UTC
    Have you looked into Bioperl? It will simplify parsing for you (especially for the sequence itself). Here's a program that gets the sequence and some other basic information:
    #!/usr/local/bin/perl -w use strict; use Bio::SeqIO; my $seqobj; print "please type in the name of a file\n"; my $file = <STDIN>; my $seqio = Bio::SeqIO->new (-format => 'GenBank', -file => $file); while ($seqobj = $seqio->next_seq()) { printf "Sequence: %s\n",$seqobj->seq; # I'm not sure what you need other than the # sequence - here's some examples: printf "Display ID: %s\n",$seqobj->display_id; printf "Description: %s\n",$seqobj->desc; printf "Division: %s\n",$seqobj->division; printf "Accession: %s\n",$seqobj->accession; }
    In your program, you're putting all of the non-sequence lines into @annotation. I'm not sure specifically which information you need (i.e. descriprtion, accession number, etc.), but those are all accessible through the "$seqobj" object. There's some examples in the code above; you'll find many more in the documentation.

    This method also has the advantage of being able to handle multiple GenBank records per file.

    This is just a tiny portion of the functions available with BioPerl - it will also parse BLAST files, perform alignments, etc. If you're interested, you can grab the latest release from CPAN or from BioPerl here. Hope this helps!

    - robsv
(smitz)Re: please help me!!
by smitz (Chaplain) on Apr 11, 2002 at 14:45 UTC
    Wow! I like it...
    Adding to what others have already pointed out:

    Are you forgetting to open the $file?
    You should probably chomp the trailing newline off of your filename before opening it. Not that important though, but it pays to be safe.
    Its only short, foxy, I'd start again, but this time plan it before turning on your computer. Always helps!


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://158312]
Approved by rob_au
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2022-05-24 15:15 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (84 votes). Check out past polls.