Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^4: Searching array against hash

by BrowserUk (Pope)
on Aug 22, 2013 at 03:18 UTC ( #1050447=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Searching array against hash
in thread Searching array against hash

It will help if you are looking to retrieve a subsequence from the human genome, the FASTA file of which is about 5 Gb;

I guess things have moved on. The version I have is just under 3GB and came in 25 files chr(1-22, M, X, Y).

That said, if his 3 posted sequences are representative of his 900,000; that means his file is a tad under 900MB.

Which if he can process that in "a few seconds"; means he could process your 5GB file in 5+bit * "a few seconds".

But, and here is the point. It will take Bio::DB::Fasta at least that same 5+bit*"a few seconds" to construct an index; before he can start processing anything. So for a one-off process, there is a net loss.

Now the real crux. Given all the additional layers and overheads; how many times does he have to redo the process in order to obtain a net gain? (If ever.)

Then add to that the (potential) problems with installation; and the learning curve of finding your way around the documentation for 897 modules to find the one that you want; and then learning how to use it to do what you want; and suddenly the reason why so many bioinformaticians are looking for Lite alternatives to the Bio::Behemoth and simple procedures in order to get their work done; rather than becoming technical debt slaves to the byzantine Bio::Empire.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^4: Searching array against hash
Re^5: Searching array against hash
by bioinformatics (Friar) on Aug 22, 2013 at 03:31 UTC

    You're not quite reading my response right. What the OP wants is to retrieve the DNA sequence that corresponds to a specific ID. That's simple/fast enough, especially when using a hash. When I want to retrieve a subsequence, chr5:1234567-1234798 for example, which is only a portion of the sequence associated with a specific record in the FASTA file, then using Bio::DB::Fasta is far faster. The module has its uses, and is why someone implemented a similar thing in python as Pygr (the indexing approach, not the parser per say). You're not wrong, Bio::DB::Fasta is overkill for this specific purpose; I just don't see bashing a tool that has been helpful to bioinformaticists for something close to a decade. Also, it installs just fine on linux, where most of the users will be using it ;)

    Bioinformatics
      You're not wrong, Bio::DB::Fasta is overkill for this specific purpose;

      That was my point.

      Also, it installs just fine on linux, where most of the users will be using it

      Often only because that's their only choice if they want to use BIO::*. Which adds yet another barrier to many of them thinking of using Perl for their work.

      I just don't see bashing a tool that has been helpful to bioinformaticists for something close to a decade.

      Why does the Python toolset exist given the lead Perl had; and the greater performance Perl (simple, clean, basic, native Perl) has over native Python?

      Perhaps because the PyGr consists of 54 files and 3MB; and installs easily wherever Python runs -- which includes Windows;

      Rather than the 2,215 files and 43MB of BioPerl that only installs on *nix.

      Perhaps if someone had been more (constructively) critical of the way the BioPerl project was going at an earlier stage; it might have favoured performance, ease-of-use and portability over monolithic architecture, Oh-OO fanatasism, and O'Woe engineering.

      (IMO)BioPerl typifies what has gone wrong with Perl in the last decade. Gone the original, Unix principle of small, fast, dedicated tools that do one thing very well that can be combined to do many more things very well; in favour of a Java-esque monolith, consisting of layers upon layers; each contributing little value and lots of overhead; and nested so deep that it is impossible for the human brain to gain oversight of the whole; or even a clear, end-to-end view of any given end-use.

      I'm in awe of the way BioPerl got started; but exasperated by where it has arrived; and troubled by where those who should be its natural user base are being driven as a result.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1050447]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2014-07-26 03:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls