seperate/extract only annotations from genbank (gbk) file.

by (Initiate)
on Jun 29, 2011 at 12:25 UTC

Hi monks, I need a simple command for linux shell to extract only annotations (non-sequence data) form genbank file. I dont need sequences at all..lets say :

ORIGIN 1 tcagaataaa cagacaaccc acagaatgtg agaaaatatt gcaaattat gcatctg +aca 61 aaggtctaat acccagcaat ctataaggaa ctcaaacaaa ttagcaagaa aaaaa +atccc 121 atgaaaaggt agacaaatga catgaataga cacttctcaa aataagatat ataaa +tagcc //

I want to delete evrything in between ORIGIN and // Just need annotations. HELP Plzz..

Re: seperate/extract only annotations from genbank (gbk) file.
by Neighbour (Friar) on Jun 29, 2011 at 12:44 UTC
    Which bit of your question pertains to Perl?
    Also, if you delete everything between ORIGIN and //, you will end up with nothing, so you could just skip with the whole dataprocessing and use cat /dev/null instead.

      oh I jus wrote the part I want to delete. theres whole lot of data in a genbank file,,ok let me write a sample:

      LOCUS NW_927708 12387 bp DNA linear CON 25 +-OCT-2010 DEFINITION Homo sapiens chromosome 2 genomic contig, alternate assemb +ly Hs_Celera 211000035800763, whole genome shotgun sequence. ACCESSION NW_927708 VERSION NW_927708.1 GI:88954435 DBLINK Project: 16116 KEYWORDS WGS. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele +ostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhin +i; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 12387) AUTHORS Istrail,S., Sutton,G.G., Florea,L., Halpern,A.L., Mobarry, +C.M., Lippert,R., Walenz,B., Shatkay,H., Dew,I., Miller,J.R. TITLE Whole-genome shotgun assembly and comparison of human geno +me assemblies JOURNAL Proc. Natl. Acad. Sci. U.S.A. 101 (7), 1916-1921 (2004) PUBMED 14769938 REFERENCE 2 (bases 1 to 12387) AUTHORS Venter,J.C., Adams,M.D., Myers,E.W., Li,P.W., Mural,R.J., Sutton,G.G., Smith,H.O., Yandell,M., Evans,C.A., Holt,R.A. TITLE The sequence of the human genome JOURNAL Science 291 (5507), 1304-1351 (2001) PUBMED 11181995 COMMENT REFSEQ INFORMATION: Features on this sequence have been pr +oduced for build 37 version 2 of the NCBI's genome annotation [se +e documentation]. The reference sequence is identical to CH4 +71348.1. Assembly Name: Hs_Celera The DNA sequence was produced by Celera Genomics. It is in +cluded in the NCBI RefSeq collection as an alternative assembly to t +he one produced by the Genome Reference Consortium. The original +whole genome shotgun project has the project accession AADB00000 +000.2. FEATURES Location/Qualifiers source 1..12387 /organism="Homo sapiens" /mol_type="genomic DNA" /db_xref="taxon:9606" /chromosome="2" gap 7139..7188 /estimated_length=50 ORIGIN 1 tcagaataaa cagacaaccc acagaatgtg agaaaatatt tgcaaattat gcatc +tgaca 61 aaggtctaat acccagcaat ctataaggaa ctcaaacaaa ttagcaagaa aaaaa +atccc 121 atgaaaaggt agacaaatga catgaataga cacttctcaa aataagatat ataaa +tagcc 181 acaaacatat gaaaaaataa tcaacatcac taatcatcag gtaaatgcaa attaa +aacca 241 taatgagata ccaccttatc ccagccagaa tggccattat tagaaagtcc aaaaa +caata 301 gatgttggca tggatgtggt gaaaagggaa gagtttacac tgcgggcagg aatgt +aaatt //

      REGARDING PERL: okk I just need the substitution pattern to remove these sequences info. leaving rest other things at its place..

Re: seperate/extract only annotations from genbank (gbk) file.
by Anonymous Monk on Jun 29, 2011 at 12:56 UTC
      Yeah thanks..I thought I would get something quick here..quite urgent.Anyways am searching..thanks for the links.
Re: seperate/extract only annotations from genbank (gbk) file.
by duelafn (Vicar) on Jul 01, 2011 at 17:54 UTC
      Hi Dean, The regex you gave is not working :( I have tried replacing ".." with other dots combination, but its not working..can you suggest something?

