please help me parse genbank DNA file

foxy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: please help me!! by Fletch (Bishop) on Apr 11, 2002 at 14:21 UTC
Including what the error message is will get better replies. `perl -wc foo` can be enlightening. `@annotation' and `@annoation' are two different things All through `get_dna` you're treating (or trying to treat at least) `$seq` as a scalar reference, but you don't pass a scalar reference when you call it `m{^//n}` is easier to read than something with all the backwhacking of `/'s. In your second `elsif` you set `$seq` when you probably meant to set `$in_sequence`	[reply] [d/l] [select]
Re: please help me!! by PrimeLord (Pilgrim) on Apr 11, 2002 at 14:27 UTC
Well I see several problems with your script. First your missing a / in your shebang statement. When you declare the annotation array at the top of the script you have it mispelled. Your warnings should have pointed out to you that @annotation wasn't declared. Your also passing your variables into your get_dna sub wrong. @_ is going to return the variables in the same order you passed them so the scalar value of @annotation is getting passed into $line and $file is getting passed into $seq. Atleast I think you didn;t mean to do that. A better way to do this would be to just declare your @annotation array inside your get_dna sub and then if you need to access it outside of that sub again just return an array ref. HTH. -Prime	[reply]
Re: please help me!! by tachyon (Chancellor) on Apr 11, 2002 at 15:19 UTC
You probably want something like this but as you don't describe what this code is supposed to do (and it is somewhat incomprehensible) I am kinda guessing #!/usr/local/bin/perl -w use strict; print "Please type in the name of a file\n"; my $file = <STDIN>; chomp $file; my ($annotation, $seq) = get_dna ($file); # convert $annotation array reference to a real array my @annotation = @$annotation; print "Annotation:\n\n", @annotation, "\n\n\n"; print "Sequence:\n\n", $seq, "\n"; sub get_dna { my ($file) = @_; my @annotation = (); my $seq = ''; my $in_sequence = 0; open FILE, $file or die "Can't open $file, Perl says $!\n"; while ( my $line = <FILE> ) { last if $line =~ m\|^//\n\|; # stop if line has just // on it $in_sequence = 1 if $line =~ m/^ORIGIN/; if ($in_sequence) { $seq .= $line; } else { push @annotation, $line; } } close FILE; # remove all spaces and digits from $seq. add \n to remove newline +s $seq =~ s/[\s0-9]//g; # return @annotation array as a scalar reference and scalar $seq return \@annotation, $seq; } [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: please help me!! by robsv (Curate) on Apr 11, 2002 at 18:15 UTC
Have you looked into Bioperl? It will simplify parsing for you (especially for the sequence itself). Here's a program that gets the sequence and some other basic information: #!/usr/local/bin/perl -w use strict; use Bio::SeqIO; my $seqobj; print "please type in the name of a file\n"; my $file = <STDIN>; my $seqio = Bio::SeqIO->new (-format => 'GenBank', -file => $file); while ($seqobj = $seqio->next_seq()) { printf "Sequence: %s\n",$seqobj->seq; # I'm not sure what you need other than the # sequence - here's some examples: printf "Display ID: %s\n",$seqobj->display_id; printf "Description: %s\n",$seqobj->desc; printf "Division: %s\n",$seqobj->division; printf "Accession: %s\n",$seqobj->accession; } [download] In your program, you're putting all of the non-sequence lines into @annotation. I'm not sure specifically which information you need (i.e. descriprtion, accession number, etc.), but those are all accessible through the "$seqobj" object. There's some examples in the code above; you'll find many more in the documentation. This method also has the advantage of being able to handle multiple GenBank records per file. This is just a tiny portion of the functions available with BioPerl - it will also parse BLAST files, perform alignments, etc. If you're interested, you can grab the latest release from CPAN or from BioPerl here. Hope this helps! - robsv	[reply] [d/l]
(smitz)Re: please help me!! by smitz (Chaplain) on Apr 11, 2002 at 14:45 UTC
Wow! I like it... Adding to what others have already pointed out: Are you forgetting to open the `$file`? You should probably `chomp` the trailing newline off of your filename before opening it. Not that important though, but it pays to be safe. Its only short, foxy, I'd start again, but this time plan it before turning on your computer. Always helps! SMiTZ	[reply] [d/l] [select]


Do you know where your variables are?
	PerlMonks