Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

i need help to download text data from ensemble for a project

by aaabc (Initiate)
on Feb 28, 2012 at 13:12 UTC ( [id://956651]=perlquestion: print w/replies, xml ) Need Help??

aaabc has asked for the wisdom of the Perl Monks concerning the following question:

Hi all perl monks. I am a master's student, for a scientific project, i will analyze 6 primates' genomes which are already aligned in ensemble. However for some reason i can only get 800k bases(characters). I need to get all of them (about 3 billion). I learnt perl for scripting for small purposes but i don't know anything about its relationship with web. My perl code is almost ready to analyze after getting text files but i need to get these text files in sequence. Here is my problem :

http://www.ensembl.org/Homo_sapiens/Export/Output/Location/Alignment?align=548;db=core;output=alignment;r=11:4953971-5753968;format=fasta;_format=Text

This link has 5 animals' dna aligned with human chromosome 11 from 4953971-5753970. I need to automate and next i should get text file of another 800k characters from 5753971 to 6553970. and so on..

After i download text files to my computer, i can do rest. by the way i am using windows7. Thank you for your help.

  • Comment on i need help to download text data from ensemble for a project

Replies are listed 'Best First'.
Re: i need help to download text data from ensemble for a project
by erix (Prior) on Feb 28, 2012 at 13:45 UTC

    There is an extensive API for the ensembl data, but it's a bit of a learning curve.

    Perhaps this is an easier way to get at their sixway primate alignment:

    Main site http://www.ensembl.org/index.html -> links to http://www.ensembl.org/downloads.html -> links to http://www.ensembl.org/info/data/ftp/index.html There, the EMF link leads to: ftp://ftp.ensembl.org/pub/release-66/emf/ensembl-compara/epo_6_primate +/ (chromosome 11 is there, too)

    The emf format is explained in a readme in that directory.

    (if you prefer to get your stuff via the urls, make sure you give ample waiting times between the retrievals (i.e.: sleep 20, or whatever duration ensembl requests: it will be in their API's documentation ) or you risk being banned from ensembl servers.)

    (BTW, it's EnsEMBL, without the -e. See EMBL)

Re: i need help to download text data from ensemble for a project
by Marshall (Canon) on Feb 28, 2012 at 13:45 UTC
    EnsEMBL Export says:
    We respectfully request that you do not script against the export pages on the Ensembl website, as this degrades the service for other web visitors. The public MySQL server is provided specifically for this purpose. Thank you. If you wish to extract multiple features or regions, we recommend using the Perl API if possible.
    I looked around a bit on their site and these folks provide lots of Perl software. Looks like you can get what you want if you "play by their rules" for huge downloads.

    Update:
    I've never used this particular site before, but I've used other large sites. When you are going to be retrieving a lot of data, it really pays to read the details - special tools are usually available - and you do risk getting on the "bad-boy" list if you don't play by the rules. One of the slickest sites that I use has a gizmo where I can launch a custom DB query...I give them an e-mail address and they send me an e-mail with a URL that is valid for a few days when my data is ready for download. It might take an hour+ for my query to process, but I can get a lot of customized data this way and it doesn't impact the website performance. Just an implementation thought and observation from something that I've seen work well as a user.

Re: i need help to download text data from ensemble for a project
by Anonymous Monk on Feb 28, 2012 at 13:16 UTC

    However for some reason i can only get 800k bases(characters)

    That is probably explained in the FAQ/Help or Terms of Use :)

      i know they dont let you download alignmnets chromosome by chromosome but while writing main code, i was able to download 1 million bases. in theory it said you could download 5 million in one try, but again i should get these text files in sequence to reach all 3 billion

        My point was to check the terms of service

        A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://956651]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-23 04:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found