|
|
| "be consistent" | |
| PerlMonks |
Re^3: Debugging Bioperl warnings for Genebank files that are missing infoby Anonymous Monk |
| on Oct 25, 2014 at 13:43 UTC ( [id://1104962]=note: print w/replies, xml ) | Need Help?? |
|
By converting the gb files to gff you're only getting a list of annotation information and coordinates about the genes within that species, so yes, the output will not be the same but that should not be a problem because having a fasta file handy you would parse that gff list for the CDS coordinates and extract the corresponding subsequences from the fasta file. You'll be surprised by the amount of incorrect deposition to genbank but that really depends on the interest on the organism, for instance, human and bacterial entries are more appropriately kept, if the deposited data are part of a published study then it is on the authors to share their data as a prerequisite to publishing, many of them don't have dedicated support to ensure that their data is of good integrity. You can always exclude an empty file if it doesn't pass your inclusion standards. EMBL and ENSEMBLE have an FTP facility as well and it seems they have information about the species you're studying. Check the file new_genomes.txt in this link. Explore it and if you've more queries you'd be better directing them at platforms such as www.biostars.org or seqanswers. As far as my approach is concerned, I guess I have suggested one of the simplest work-arounds since I've been repeatedly bitten by genbank data myself, but that doesn't mean you won't bump into other problems in the other databases though ;) A 4 year old monk
In Section
Seekers of Perl Wisdom
|
|
||||||||||||||||||||||||||||||