Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling

Bioperl or ncbi: parsing refseq files

by roibrodo (Sexton)
on Jun 14, 2010 at 13:51 UTC ( #844602=perlquestion: print w/replies, xml ) Need Help??
roibrodo has asked for the wisdom of the Perl Monks concerning the following question:


I have downloaded some large RefSeq files (Genbank format, i.e. both sequence and annotations; e.g.

I want to get a part of that sequence along with its annotations - exactly as I do online using the "Change region shown" window on the right - then dump this as a (smaller) Genbank file. Very simple... but I couldn't figure out how to do that.



one way of achieving this is using seqret from emboss: e.g. seqret -sformat1 gb -sequence "in_seq.genbank" -sbegin 10000 -send 50000 -feature -outseq out_seq.genbank.genbank -osformat2 gb

Thanks to vinnana from the bioperl community!

Replies are listed 'Best First'.
Re: Bioperl or ncbi: parsing refseq files
by BioLion (Curate) on Jun 14, 2010 at 15:45 UTC

    Hi, your flat genbank files can be handled using the BioPerl suite. See this HowTo for a very detailed guide to IO, which includes genbank parsers (and a lot of others).

    You'll be using the Bio::SeqIO module ( See Yes, even you can use CPAN for a guide on getting modules installed, or ignore if i am patronising you... ) to read in the files, test each sequence feature if it is in your region of interest, and if it is, write it out to a fresh (smaller) file. You can do all this on-the-fly, so your large file shouldn't trouble memory problems...

    Have a read, have a go, and get back to the Monastery (with examples and code) if you are having problems. Hope this helps!

    Update: Typos...

    Just a something something...

      Thanks for the reply.

      I'm not sure I got it right. Even if the feature is not fully in the region of interest, but only partially in it, I want to "truncate" it and take it. I also want the sequence (that appears after all the features) to be outputted. Basically, I want to do exactly what the "change region shown" does on the online version of NCBI.

      I would appreciate a more verbose example, if possible, since this are my first steps with BioPerl.

Re: Bioperl or ncbi: parsing refseq files
by david_lyon (Sexton) on Jun 14, 2010 at 18:33 UTC

    If I was doing this which I have done a million times you want to use one if the emboss packages which can parse out the features and sequences that you are interested in or spit out the features in a table format so you can upload them into a db or parse them on the fly. eg:
    I would avoid using bioperl for this.

      Thanks for the reply.

      I read some of the documentation but did not find an appropriate example.

      Would you be kind enough to let some hints on how to get from a GenBank file to a smaller GenBank file containing some range of the sequence? It will really help getting stated with emboss (I think examples are a great way to start experimenting)


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://844602]
Approved by BioLion
Front-paged by planetscape
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2018-01-22 07:18 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (232 votes). Check out past polls.