Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: Debugging Bioperl warnings for Genebank files that are missing info

by Sosi (Sexton)
on Oct 24, 2014 at 14:27 UTC ( [id://1104869]=note: print w/replies, xml ) Need Help??


in reply to Re: Debugging Bioperl warnings for Genebank files that are missing info
in thread Debugging Bioperl warnings for Genebank files that are missing info

Thanks! I can get your two first points by simply downloading the files from where I got the .gbff (i.e. this link has the .gbff, .gff, and .faa files for one of the organisms), though I can see that using genbank2gff3.pl is probably easier/faster but I'm not sure the output will be the same. I believe BioPerl::SeqIO already takes into account the reverse complement there, but I'll make sure.

More importantly, if those .gbff don't have the sequences as in my examples, extracting them will not be possible anyway.

I am wondering why many of these files are wrongly deposited in the first place. And if they are, why isn't this automatically corrected by NCBI itself... I was expecting this to be much easier than it is proving to be. but I guess this isn't the place to rant about this eh :)

Off topic:If you find an easier way to get the CDS and the protein sequences please let me know. Even if it involves not using Genbank, as long as I can use NCBI's FTP everything is fine...

  • Comment on Re^2: Debugging Bioperl warnings for Genebank files that are missing info

Replies are listed 'Best First'.
Re^3: Debugging Bioperl warnings for Genebank files that are missing info
by erix (Prior) on Oct 25, 2014 at 14:33 UTC

    If you find an easier way to get the CDS and the protein sequences please let me know. Even if it involves not using Genbank, as long as I can use NCBI's FTP everything is fine...

    "not using Genbank" and "as long as I can use NCBI's FTP" seems a bit contradictory. GenBank is NCBI, no?

    Perhaps UniProt.org has what you want? (You haven't said what it is that you're after...)

    For example, 742726 being your tax_id (taxonomy ID), it delivers the proteins with http://www.uniprot.org/uniprot/?query=taxonomy:742726&sort=score

    I see there is a gff tab for download (easily turned into a URL).

    As far as I know, NCBI, Ensembl, and UniProt exchange sequences and annotation regularly.

      Eh indeed that is contradictory. I guess, in my mind I was thinking "well, I don't mind using some other approach, as long as I can use NCBI so that all info retrieved is consistent". Now, given the problems that I've found, I guess "consistency" is not the best word to describe NCBI's FTP..

      The idea of using Uniprot is ok for retrieving the proteins. I am now trying to see if there is an easy way to retrieve their genomic sequences because this is the big problem that I am facing now.

      Also, I added a snippet of the kind of output that I would like to get.

Re^3: Debugging Bioperl warnings for Genebank files that are missing info
by Anonymous Monk on Oct 25, 2014 at 13:43 UTC

    By converting the gb files to gff you're only getting a list of annotation information and coordinates about the genes within that species, so yes, the output will not be the same but that should not be a problem because having a fasta file handy you would parse that gff list for the CDS coordinates and extract the corresponding subsequences from the fasta file.

    You'll be surprised by the amount of incorrect deposition to genbank but that really depends on the interest on the organism, for instance, human and bacterial entries are more appropriately kept, if the deposited data are part of a published study then it is on the authors to share their data as a prerequisite to publishing, many of them don't have dedicated support to ensure that their data is of good integrity. You can always exclude an empty file if it doesn't pass your inclusion standards.

    EMBL and ENSEMBLE have an FTP facility as well and it seems they have information about the species you're studying. Check the file new_genomes.txt in this link. Explore it and if you've more queries you'd be better directing them at platforms such as www.biostars.org or seqanswers. As far as my approach is concerned, I guess I have suggested one of the simplest work-arounds since I've been repeatedly bitten by genbank data myself, but that doesn't mean you won't bump into other problems in the other databases though ;)


    A 4 year old monk

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1104869]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (2)
As of 2025-11-16 07:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your view on AI coding assistants?





    Results (72 votes). Check out past polls.

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.