Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Debugging Bioperl warnings for Genebank files that are missing info

by biohisham (Priest)
on Oct 24, 2014 at 00:43 UTC ( #1104822=note: print w/replies, xml ) Need Help??


in reply to Debugging Bioperl warnings for Genebank files that are missing info

Yes at times the genbank files can be problematic in that they are incomplete or that BioPerl gets cranky, you have not provided a code that I can test but if you may consider the following workaround, work with the fasta files in conjunction with the feature table provided in the genbank files

  • convert the genbank to gff through (genbank2gff3.pl)
  • convert the genbank files to fasta or download the fasta equivalent
  • parse the gff files and extract the CDs with their coordinate information

    perl -F'\t' -lane 'if($F[2] eq "CDS"){print}' GCA_000153565.1_ASM15356v1_genomic.gbff.gff | cut -f3,4,5,7 > GCA_000153565.1_ASM15356v1_genomic.coordinates.txt

  • extract the subsequences from the fasta files using the coordinates saved in GCA_000153565.1_ASM15356v1_genomic.coordinates.txt

For the last item you may use BioPerl::SeqIO $seq->subseq(start..stop) but make sure you get the reverse translation of the seqs in the negative strand


A 4 year old monk

Replies are listed 'Best First'.
Re^2: Debugging Bioperl warnings for Genebank files that are missing info
by Sosi (Sexton) on Oct 24, 2014 at 14:27 UTC

    Thanks! I can get your two first points by simply downloading the files from where I got the .gbff (i.e. this link has the .gbff, .gff, and .faa files for one of the organisms), though I can see that using genbank2gff3.pl is probably easier/faster but I'm not sure the output will be the same. I believe BioPerl::SeqIO already takes into account the reverse complement there, but I'll make sure.

    More importantly, if those .gbff don't have the sequences as in my examples, extracting them will not be possible anyway.

    I am wondering why many of these files are wrongly deposited in the first place. And if they are, why isn't this automatically corrected by NCBI itself... I was expecting this to be much easier than it is proving to be. but I guess this isn't the place to rant about this eh :)

    Off topic:If you find an easier way to get the CDS and the protein sequences please let me know. Even if it involves not using Genbank, as long as I can use NCBI's FTP everything is fine...

      If you find an easier way to get the CDS and the protein sequences please let me know. Even if it involves not using Genbank, as long as I can use NCBI's FTP everything is fine...

      "not using Genbank" and "as long as I can use NCBI's FTP" seems a bit contradictory. GenBank is NCBI, no?

      Perhaps UniProt.org has what you want? (You haven't said what it is that you're after...)

      For example, 742726 being your tax_id (taxonomy ID), it delivers the proteins with http://www.uniprot.org/uniprot/?query=taxonomy:742726&sort=score

      I see there is a gff tab for download (easily turned into a URL).

      As far as I know, NCBI, Ensembl, and UniProt exchange sequences and annotation regularly.

        Eh indeed that is contradictory. I guess, in my mind I was thinking "well, I don't mind using some other approach, as long as I can use NCBI so that all info retrieved is consistent". Now, given the problems that I've found, I guess "consistency" is not the best word to describe NCBI's FTP..

        The idea of using Uniprot is ok for retrieving the proteins. I am now trying to see if there is an easy way to retrieve their genomic sequences because this is the big problem that I am facing now.

        Also, I added a snippet of the kind of output that I would like to get.

      By converting the gb files to gff you're only getting a list of annotation information and coordinates about the genes within that species, so yes, the output will not be the same but that should not be a problem because having a fasta file handy you would parse that gff list for the CDS coordinates and extract the corresponding subsequences from the fasta file.

      You'll be surprised by the amount of incorrect deposition to genbank but that really depends on the interest on the organism, for instance, human and bacterial entries are more appropriately kept, if the deposited data are part of a published study then it is on the authors to share their data as a prerequisite to publishing, many of them don't have dedicated support to ensure that their data is of good integrity. You can always exclude an empty file if it doesn't pass your inclusion standards.

      EMBL and ENSEMBLE have an FTP facility as well and it seems they have information about the species you're studying. Check the file new_genomes.txt in this link. Explore it and if you've more queries you'd be better directing them at platforms such as www.biostars.org or seqanswers. As far as my approach is concerned, I guess I have suggested one of the simplest work-arounds since I've been repeatedly bitten by genbank data myself, but that doesn't mean you won't bump into other problems in the other databases though ;)


      A 4 year old monk

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1104822]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2020-06-04 06:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you really want to know if there is extraterrestrial life?



    Results (30 votes). Check out past polls.

    Notices?