Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Spltting Genbank File

by citromatik (Curate)
on May 29, 2009 at 07:38 UTC ( #766822=note: print w/ replies, xml ) Need Help??


in reply to Spltting Genbank File

A Genbank file is made of a series of Genbank records separated by a line consisting of "//"

You are reading your file line by line:  while ($line = <$IN>) and splitting each line in different lines (??) my @line = split(/\///,$line). Obviously, this is not what you want.

An easy way to solve this is to read the file record by record by assigning the value "//" to $/ (see perlvar) and then process the record

Another possibility is to read the file line by line, and update the $accession, $biotype and $sequence variables, accordingly. But you can be in a problem if a record doesn't have one of them

Update: Following the first approach:

use strict; use warnings; $/ = "//"; my $genfile = "c:\bemisia_coi.gb"; open my $ifh, "<", $genfile or die "cannot open $genfile: $!\n"; while (my $chunk = <$ifh>){ last if eof $ifh; my ($accession) = $chunk =~ /LOCUS\s*([A-Z]*\d+)/; my ($biotype) = $chunk =~ /BIOTYPE\s*([A-Z])/; my ($sequence) = $chunk =~ /ORIGIN\s*(.*)$/s; $sequence =~ s/\s|\d//g; my $outfile = "${accession}_${biotype}"; open my $ofh, '>' $outfile or die "cannot open $outfile: $!\n"; print $ofh, ">$accession\n$sequence"; close $ofh; }

Additional comments:

  • You don't need to lc your input
  • I've never seen a "biotype" field in a Genbank file (I can't find it in the specs either), but I may be wrong, so I conserved the code that tries to capture that field, instead you can see the molecule type in the LOCUS field (DNA|PROT)
  • A typical "locus name" is made of letters (2 or 3) and numbers, you only test for 8 letters
  • The last record of the file is always empty (because the file ends with a "//") line, that is why I use last if eof $ifh;
  • Don't forget to close each output file when it is not needed, a Genbank file could be very large and you can run out of open filehandles, they are a finite resource on every operating system
  • The "ORIGIN" sub-record is multi-line, so you should add the /s modifier to your regexp (see perlre)
  • If you want to scape the "//" in a regular expression, you should escape both slashes: /\/\//
  • $sequence =~ s/\s//g; $sequence =~ s/\d//g; can be written $sequence =~ s/\s|\d//g;

citromatik


Comment on Re: Spltting Genbank File
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://766822]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2015-07-03 15:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (53 votes), past polls