Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

NCBI sequence fetching

by capemaster (Initiate)
on May 12, 2009 at 08:18 UTC ( #763435=perlquestion: print w/replies, xml ) Need Help??
capemaster has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all the kind monks!
The Perl force is very low in me but I have tried to write a script (ok, I have used something yet written and then modified) to retrieve multiple sequence in fasta format from GenBank.

Here is the code
#!/usr/bin/perl -w use Bio::Perl; $database="genbank"; @accessions = ( "bunch", "of", "accession", "numbers"); $count = 1; $n = 0; while ($accessions[$n]) { $id=$accessions[$n]; $format="fasta"; $sequence = get_sequence($database, $id); write_sequence(">-", $format, $sequence); $n++; $count++; sleep(1); }
The problem is that it works randomly: I need to donwload about 2 thousand sequences for further studies but the scripts gets an exception due to a server error.
Here is the error:
------------ EXCEPTION ------------- MSG: WebDBSeqI Request Error: HTTP/1.1 503 Service Temporarily Unavailable Connection: close Date: Tue, 12 May 2009 07:57:00 GMT Accept-Ranges: bytes Server: Apache Vary: accept-language,accept-charset Content-Language: en Content-Type: text/html; charset=iso-8859-1 Client-Date: Tue, 12 May 2009 07:57:30 GMT Client-Peer: Client-Response-Num: 1 Link: <>; /="/"; rev="made" Title: Service unavailable! <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" ""> <html xmlns="" lang="en" xml:lang="en"> <head> <title>Service unavailable!</title> <link rev="made" href="" /> <style type="text/css"><!--/*--><![CDATA[/*><!--*/ body { color: #000000; background-color: #FFFFFF; } a:link { color: #0000CC; } p, address {margin-left: 3em;} span {font-size: smaller;} /*]]>*/--></style> </head> <body> <h1>Service unavailable!</h1> <p> The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later. </p> <p> If you think this is a server error, please contact the <a href="">webmaster</a>. </p> <h2>Error 503</h2> <address> <a href="/"></a><br /> <span>Tue May 12 03:57:00 2009<br /> Apache</span> </address> </body> </html> STACK Bio::DB::WebDBSeqI::_stream_request /sw/lib/perl5/5.8.8/Bio/DB/W STACK Bio::DB::WebDBSeqI::get_seq_stream /sw/lib/perl5/5.8.8/Bio/DB/We STACK Bio::DB::NCBIHelper::get_Stream_by_acc /sw/lib/perl5/5.8.8/Bio/D +B/ STACK Bio::DB::WebDBSeqI::get_Seq_by_acc /sw/lib/perl5/5.8.8/Bio/DB/We STACK Bio::Perl::get_sequence /sw/lib/perl5/5.8.8/Bio/ STACK toplevel Desktop/ -------------------------------------- ------------- EXCEPTION ------------- MSG: acc AA387173 does not exist STACK Bio::DB::WebDBSeqI::get_Seq_by_acc /sw/lib/perl5/5.8.8/Bio/DB/We STACK Bio::Perl::get_sequence /sw/lib/perl5/5.8.8/Bio/ STACK toplevel Desktop/ --------------------------------------
I thought that it could be an NCBI issue... they say that for multiple request one have to wait nigth time or WE and do not overload the system with more than 3 request per second, but I used this in nigth timr and added a sleep(1) in the while.

Can someone help me?

Replies are listed 'Best First'.
Re: NCBI sequence fetching
by citromatik (Curate) on May 12, 2009 at 08:59 UTC

    Instead of Bioperl you can try NCBI's eutils. In particular the EFetch tool to retrieve fasta sequences given a list of accession identifiers


      Thank you... Tha fact is that with eFetch I don't know how to start :D

      Can you help me on this aspect?

        In the eutils help page you can find a sample Perl program that you can download and execute locally. Try to understand what it does, and then try to modify it to fits your purposes.

        If you have any problem in the process, do not hesitate to post a new question here

        Hope this helps


Re: NCBI sequence fetching
by binf-jw (Monk) on May 12, 2009 at 09:04 UTC
    Try catching those sequences which can't be downloaded. You can catch the BioPerl $obj->throw(); method from Bio::Root::root with a simple eval block.
    Try something like this: (Not tested)
    #!usr/bin/perl use English '-no_match_vars'; use strict; use warnings; use Bio::Perl; my $database = 'genbank'; my $format = 'fasta'; my @accessions = ( "bunch", "of", "accession", "numbers"); for my $i ( 0 ... 2000 ) { my $entry_id = $accessions[$i]; eval { my $sequence = get_sequence($database, $entry_id); write_sequence( '>-', $format, $sequence ); }; if ( $EVAL_ERROR ) { # Log Error here if an error occured print {*STDERR} "Could not download sequence: [$entry_id]\n"; } # If you need to keep track of count my $count = $i + 1; }

      Thank you.
      I have tried the script but it has a strange behaviour: the sequences are fetched multiple times after an error.
        Hmmmmm, Do you mean after the error it continues trying to get the same sequence?
        It's in a for loop so should only try each one the entries once unless they appear multiple times in the array "@entries".
        Forgot to change this when I wrote the code but the for loop should be:
        for my $i ( 0 ... scalar @entries - 1 ) { }
Re: NCBI sequence fetching
by Anonymous Monk on May 12, 2009 at 08:31 UTC
    There is no way around HTTP/1.1 503 Service Temporarily Unavailable. You have to keep track of your downloads, and re-try those that failed.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://763435]
Approved by marto
NodeReaper stokes the furnace

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2017-09-22 18:12 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (266 votes). Check out past polls.