Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^2: Spltting Genbank File

by perl_n00b (Acolyte)
on May 29, 2009 at 18:54 UTC ( [id://766924]=note: print w/replies, xml ) Need Help??


in reply to Re: Spltting Genbank File
in thread Spltting Genbank File

First off, thanks for all the replies guys! I shouldn't have rushed my posting since I left out some much needed information so sorry about that.
Here is a sample entry in the genbank file. The biotype entry is after "/note". I thought I would have to do all lower case because a couple of the entries weren't uniform and had uppercase letters; is this not needed?
LOCUS EU099432 832 bp DNA linear INV 27 +-APR-2009 DEFINITION Bemisia tabaci strain 05-06 cytochrome oxidase subunit I-l +ike (COI) gene, partial sequence; mitochondrial. ACCESSION EU099432 VERSION EU099432.1 GI:158726330 KEYWORDS . SOURCE mitochondrion Bemisia tabaci ORGANISM Bemisia tabaci Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygo +ta; Neoptera; Paraneoptera; Hemiptera; Sternorrhyncha; Aleyrod +iformes; Aleyrodoidea; Aleyrodidae; Aleyrodinae; Bemisia. REFERENCE 1 (bases 1 to 832) AUTHORS Ma,W.H., Li,X.C., Dennehy,T.J., Lei,C.L., Wang,M., Degain, +B.A. and Nichols,R.L. TITLE Utility of MtCOI polymerase chain reaction-restriction fra +gment length polymorphism in differentiating between Q and B whi +tefly Bemisia tabaci biotypes JOURNAL Insect Sci. 16 (2), 107-114 (2009) REFERENCE 2 (bases 1 to 832) AUTHORS Ma,W., Li,X., Degain,B. and Dennehy,T. TITLE Direct Submission JOURNAL Submitted (13-AUG-2007) Entomology, The University of Ariz +ona, 1140 E. 4th St., Tucson, AZ 85721, USA FEATURES Location/Qualifiers source 1..832 /organism="Bemisia tabaci" /organelle="mitochondrion" /mol_type="genomic DNA" /strain="05-06" /db_xref="taxon:7038" /country="USA: Arizona" /note="biotype: B" gene <1..>832 /gene="COI" misc_feature <1..>832 /gene="COI" /note="similar to cytochrome oxidase subunit I" ORIGIN 1 atatgcatgg agtgattttt tggtccccca gaagtaatat ggcagattag tgcat +tggac 61 ttgatttgtt tggtcatcca taaggcaaaa ggcacaatag ggcttcgaag gttta +ttgtt 121 tgacgtcctc atatattcac agttggaata gatgtagata ctcgagctta tttca +cttca 181 gccactataa ttattgctgt tcccacagga attaaaattt ttagttggct tgcta +ctttg 241 ggtggaataa agtctaataa attaaggcct cttggccttt gatttacagg atttt +tattt 301 ttatttacta taggtgggtt aactggaatt attcttggta attcttctgt agatg +tgtgt 361 ctgcatgaca cttattttgt tgttgcacat tttcattatg ttttatcaat aggaa +ttatt 421 tttgctattg taggaggagt tatctattga tttccactaa tcttaggttt aacct +taaat 481 aattatagat tggtgtctca attttatatc atgtttattg gagtaaattt aactt +ttttt 541 cctcaccatt ttcttggttt agggggaatg cctcgtcgat attcagatta tgctg +attgc 601 tatctagtat gaaataaaat ttcttctgcg ggaaggattc tgagtattat ttctg +ttatt 661 tattttttat ttattgtttt agaatccttt cttcttctgc ggttagtaag attta +agctt 721 ggtgtaagta ggcatctaga atgaaagatt aataaaccag ctcttaatca cagtt +ttaaa 781 gagttgtgtt taactttttt tttctaatat ggcagattag ggccccggga aa //

Here is my updated code that now contains more errors lol.
# Genbank Splitter # Takes Accession number and biotype as name for new FASTA file # Contents of new FASTA file is corresponding sequence use strict; use warnings; $/ = "//"; # Constants my $genfile = 'c:\bemisia_coi.gb'; my $outfile = "$accession_$biotype"; my ($OUT, $IN); print "Input: $genfile\n"; open my $ifh, "<", $genfile or die "cannot open $genfile: $!\n"; while (my $chunk = <$ifh>){ last if eof $ifh; $chunk = lc $chunk; my ($accession) = $chunk =~ /locus\s*([a-z]{8}); my ($biotype) = $chunk =~ /biotype: ([a-z]{1}); my ($sequence) = $chunk =~ "/origin(\*+)\/\/\"; $sequence =~ s/\s|\d//g; my $outfile = "${accession}_${biotype}"; open my $ofh, '>' $outfile or die "cannot open $outfile: $!\n"; print "Printing to $outfile\n"; print $ofh, ">$accession $biotype\n^^\n$sequence"; close $ofh; }

And here are the errors I get
C:\Users\Owner>perl c:\gen_split2.pl Bareword found where operator expected at c:\gen_split2.pl line 22, ne +ar "my ($b iotype) = $chunk =~ /biotype" (Might be a runaway multi-line // string starting on line 21) (Do you need to predeclare my?) Backslash found where operator expected at c:\gen_split2.pl line 22, n +ear "bioty pe\" Unrecognized escape \s passed through at c:\gen_split2.pl line 23. Unrecognized escape \d passed through at c:\gen_split2.pl line 23. Scalar found where operator expected at c:\gen_split2.pl line 25, near + "my $outf ile = "${accession}" (Might be a runaway multi-line "" string starting on line 23) (Do you need to predeclare my?) Bareword found where operator expected at c:\gen_split2.pl line 25, ne +ar "${acce ssion}_" (Missing operator before _?) String found where operator expected at c:\gen_split2.pl line 27, near + "open my $ofh, '>' $outfile or die "" (Might be a runaway multi-line "" string starting on line 25) (Missing semicolon on previous line?) Backslash found where operator expected at c:\gen_split2.pl line 27, n +ear "$!\" (Missing operator before \?) String found where operator expected at c:\gen_split2.pl line 29, near + "print "" (Might be a runaway multi-line "" string starting on line 27) (Missing semicolon on previous line?) Bareword found where operator expected at c:\gen_split2.pl line 29, ne +ar "print "Printing" (Do you need to predeclare print?) Backslash found where operator expected at c:\gen_split2.pl line 29, n +ear "$outf ile\" (Missing operator before \?) String found where operator expected at c:\gen_split2.pl line 31, near + "print $o fh, "" (Might be a runaway multi-line "" string starting on line 29) (Missing semicolon on previous line?) Scalar found where operator expected at c:\gen_split2.pl line 31, near + "$accessi on $biotype" Global symbol "$accession_" requires explicit package name at c:\gen_s +plit2.pl l ine 12. Global symbol "$biotype" requires explicit package name at c:\gen_spli +t2.pl line 12. Global symbol "$biotype" requires explicit package name at c:\gen_spli +t2.pl line 21. syntax error at c:\gen_split2.pl line 22, near "my ($biotype) = $chunk + =~ /bioty pe" Global symbol "$chunk" requires explicit package name at c:\gen_split2 +.pl line 2 3. Global symbol "$sequence" requires explicit package name at c:\gen_spl +it2.pl lin e 23. syntax error at c:\gen_split2.pl line 25, near "my $outfile = "${acces +sion}" Global symbol "$accession" requires explicit package name at c:\gen_sp +lit2.pl li ne 25. Global symbol "$biotype" requires explicit package name at c:\gen_spli +t2.pl line 25. Global symbol "$accession" requires explicit package name at c:\gen_sp +lit2.pl li ne 31. c:\gen_split2.pl has too many errors.

Replies are listed 'Best First'.
Re^3: Spltting Genbank File
by citromatik (Curate) on Jun 01, 2009 at 07:28 UTC

    You have several errors and mis-conceptions here:

    • my ($accession) = $chunk =~ /locus\s*([a-z]{8}); You haven't closed the regexp (the final slash is missing): /locus\s*([a-z]{8})/;
    • /biotype: ([a-z]{1});. Same as before
    • open my $ofh, '>' $outfile or die "cannot open $outfile: $!\n";. When you use the 3-argument open, you should provide 3 arguments separated with commas (the comma after '>' is missing)
    • my $outfile = "$accession_$biotype";. You are trying to use the variables $accession and $biotype before declaring them. Bare in mind that both variables are updated after reading each record, so you should update $outfile (and open the file for writing) after reading each record (inside the while loop) as I told you in the code I posted. Also, bear in mind what toolic told you about using ${accession}_$biotype

    citromatik

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://766924]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2024-04-23 13:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found