Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

RFC: A call to bioinformationalists for some generic information.

by BrowserUk (Pope)
on May 27, 2015 at 23:59 UTC ( #1128096=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I don't need or want anything proprietary! (But accuracy would help!)

If you have recently run a fuzzy search for short sequences (primers?) (<32 bases) against a (publicly available) long sequence (~1GB or bigger), and have the knowledge/information available to answer the following questions, it would be greatly appreciated.

  1. How long was the big sequence?

    (And preferably -- though not absolutely necessary -- where can I download a copy.)

  2. How many short sequences, and their length(s).

    Figures like approx. 200 around 25-bases is better than nothing.

    205 x average length 19 ranging from 14 to 25 is better.

    A list of exact lengths better yet.

    (Best of all would be a file of the actual sequences used; but I realise that might be verboten.)

  3. How fuzzy?

    Ie. What Hamming distance was acceptable for a match?

    If your run used more complex rules (eg. < 3 insert or deletes and upto 5 transpositions), those details would help.

    Also, if you used one of the BLASTx programs with a minimum "word length"; details of that setting would be important.

  4. How long did the run take?

    Here I really need more than just elapsed (wall clock) time.

    Perfection would be the number of clock cycles or cpu seconds; which would be further enhanced if details of the CPU(s) used was available.

  5. How many match sites were discovered?

    Just the overall number of match sites would suffice.

    Match sites per short sequence would be ideal, assuming that I can have the input sequences as well.

  6. What hardware was the run performed on?

    In some ways this is the most important criteria. CPU type(s); no. of cores/type & clock speeds would be best.

The reason:

I think I've found a better (more accurate and much faster) way to do such fuzzy searches; but before expending lots of effort on putting together a proper package for CPAN -- this is a pure, for fun, home project; not work -- then I'd really like to make some detail comparisons with the current state-of-the-art to convince myself that it a) works; b) is sufficiently faster to warrant the effort.

Basically, I want to run my crude prototype code against a few real (or at least realistic) testcases with known results and timings to see how it stands up before taking it any further.

Thanks for any help you can provide.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
  • Comment on RFC: A call to bioinformationalists for some generic information.

Replies are listed 'Best First'.
Re: RFC: A call to bioinformationalists for some generic information.
by monkey_boy (Priest) on May 28, 2015 at 15:33 UTC

    My old job was basically the company BLAST-monkey, I could probably give you some specific help in a personal converstation. However, I havent ran any BLAST searches in the last couple of years that I could share.


    A couple of general points though:


    1. 99.9% (<-- pure guess based on extenisve observations!) of BLAST searches are run with whatever defaults are set by the web-portal or command line.
    2. The speed of BLAST and related programs has been "fast enough" now for many years. So any improovments would need to come with "better" results (e.g. more accurate sequence alignments) to get the field excited.
    3. The Hamming distance, is not really applicable in this field (although as an intermediate pass the pre-filter a database of sequences it may have some use), as e-values are the cut-off most frequently used. However this depends largely on the reason of the search, and if you are looking for evolutionary related protein sequences or short stretches of highly similar DNA. For the latter, take a look at BLAT (https://genome.ucsc.edu/FAQ/FAQblat.html) which last time I checked was orders of magnitude faster than BLAST at these type of searches.

If you want me to cobble together a test datbase or two with some general & more tricky edge examples I could do.

Cheers, Monkeyboy
This is not a Signature...

      1. 99.9% of BLAST searches are run with whatever defaults are set by the web-portal or command line.

        I don't understand the significance of that statement.

        I've looked at the NCBI web BLAST submit screen, and I wouldn't know where to start in order to submit a "typical" request; nor how to interpret whatever results I might receive.

        What I'm working on is not a substitute for everything that BLASTx does; but might be incorporated into BLASTx (or a BLASTx replacement), but that would need to be done by people who understand the field.

        My algorithm is purely concerned with addressing the problem, (that has come up here many times over the last few years), of searching a very long string of a limited alphabet, for relatively short inputs (15-32 typical), and finding all the match sites with a specified number of mismatches.

      2. The speed of BLAST and related programs has been "fast enough" now for many years. So any improvements would need to come with "better" results (e.g. more accurate sequence alignments) to get the field excited.

        As I understand it, the way BLAST works is to build (or import a pre-built) index of short, fixed-sized exact matches -- typically minimum 7 for web-based searches -- and use that index to limit the number of positions at which exhaustive comparisons are made.

        The down-side of the approach is that for shoter inputs with higher numbers of mismatches, some potential sites are never examined.

        Ie. If looking for a 25-base input with 4 mismatches, potential match sites where the 4 mismatches are evenly distributed through the 25-bases: eg. ~....?....?....?....?.....~ will never be found, because none of the exact match bits is greater than or equal to the base index size.

        My algorithm does not suffer this limitation; it finds all potential match sites regardless of the number of mismatches.

        Moreover, the ratio of mismatches does not affect the performance in any significant fashion.

        It could (for example) find *all* the 9-base sites with 8 mismatches; or 12 with 8 or 25 with 8 in the same time; and very quickly.

      3. The Hamming distance, is not really applicable in this field ..., as e-values are the cut-off most frequently used.

        As I understand E-values, they are a function of the makeup of the sequence being searched and the subsequence being searched for.

        They are a statistical measure of the likelihood of a "random match", given the makeup of the subsequence being sought and the sequence being searched.

        As such, E-values are not affected by the search algorithm used; thus whatever filtering heuristics are currently applied, would still need to be be applied.

      What I'm getting from the similarities between: your response to my request; and a response I got to a request for information I emailed directly to the guys at the NCBI; is that the real problem is not finding match sites; but rather that of filtering the mass of match sites found to eliminate non-useful ones. And that is a process I do not understand the criteria for; and have no insights to offer.

      Indeed, I'm approaching the conclusion that because my search algorithm would find *all* potential match sites; it might actually compound the filtering problem rather than help it.

      So it looks like I may have a solution looking for a problem to solve on my hands.

      Though I can't help but think that the potential for the "best" match site (however that might be assessed) being missed, because of the minimum index size (word-length), means that a lot of searches and pre- & post-filtering are being wasted.

      I was hoping to have some basic performance numbers to post in this reply, but looking at my results a couple of hours ago I see an anomaly in the numbers coming out that I wasn't expecting, which could mean: a) my expectations were off; b) I've a bug in my code; c) the algorithm doesn't work.

      I need to determine which of those is the case before I go posting "exciting numbers" that might be completely bogus.

      Thank you for your reply. You've given me much to think about. If I get (back) to the point where I think I am ready to do comparisons, I'll /msg you.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      .
Re: RFC: A call to bioinformationalists for some generic information.
by marto (Cardinal) on May 28, 2015 at 10:11 UTC

    You could ask biohisham, failing that I'd guess that the #bioperl IRC channel would be a good place to ask such questions, or at least reach an audience who know what you're trying to do. A web based IRC client is available via their site.

Re: RFC: A call to bioinformationalists for some generic information.
by einhverfr (Friar) on May 28, 2015 at 06:31 UTC
    This is such a wide field I think you would do better to ask on a more dedicated forum.

      Thing is I need people who'll understand my lingo. If I hit a bio forum and get the terms wrong (and I would), I'd get nowhere.

      If you're active on an appropriate forum and feel like posting a pointer to the root node, that'd be great.

      Update: Plus, I guess that if no one using perl is doing this, there'd be little point in pursuing it.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        fwiw, I concur with einhverfr. Perl is used extensively in the bioinformatics community, but (I dare say) most of those folks don't hang out on perlmonks. Your chances of a "hit" are better there than here, I guess.

        I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1128096]
Approved by kcott
Front-paged by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (6)
As of 2021-05-15 20:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (150 votes). Check out past polls.

    Notices?