Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Performance Tuning: Searching Long-Sequence Permutations

by Itatsumaki (Friar)
on Jun 26, 2003 at 19:55 UTC ( [id://269390]=perlquestion: print w/replies, xml ) Need Help??

Itatsumaki has asked for the wisdom of the Perl Monks concerning the following question:

I'm simulating DNA sequences by permutation, and I've come to something of a computational roadblock. Basically, I want to search very long sequences for all occurrences of a specific substring. In general, I would want to find both the number and location of this occurrences.

The problem is that I have to do this so many times that it becomes quite time-consuming. I have several hundred "start" sequences, each with lengths in the 10-50k range, so creating and searching 10-100k permutations of these is slow. Below is the core loop that is consuming most of the time. Is there any way to speed this up?

### Typical sequence lengths are 10-50k letters my $sub = "CGAGCGTTGACGCNNAGCTAGT"; my @bases = ### Count base frequencies while ($sub =~ /([GCATN])/g) { if ($1 eq 'G') { $bases{g}++; } if ($1 eq 'C') { $bases{c}++; } if ($1 eq 'A') { $bases{a}++; } if ($1 eq 'T') { $bases{t}++; } if ($1 eq 'N') { $bases{n}++; } } ### I need to create an array of bases to pass to rand() my @bases = ( ('G') x $bases{g}, ('C') x $bases{c}, ('A') x $bases{a}, ('T') x $bases{t}, ('N') x $bases{n} ); ### These are sample regex's: I actually test 24 different ones $promoters_regex[0] = '[NT][GCATN][NG][NC][NG][NT][NG]'; $promoters_regex[1] = '[NG][NT][NG][NC][NG][GCATN][NT]'; $promoters_regex[2] = '[NA][GCATN][NC][NG][NC][NA][NC]'; $promoters_regex[3] = '[NC][NA][NC][NG][NC][GCATN][NA]'; ### I would like $permutations_count = 100_000 for (my $i = 0; $i < $permutations_count; $i++) { my $string; # create new string for (my $j = 0; $j < length($sub); $j++) { $string .= $bases[rand @bases]; } # test string for matches for (my $j = 0; $j < scalar(@promoters_regex); $j++) { while ($string =~ /$promoters_regex[$j]/g) { $permutations_total++; } } }

I'd appreciate any ideas/suggestions on how to speed this up or to better design it

-Tats

Replies are listed 'Best First'.
Re: Performance Tuning: Searching Long-Sequence Permutations
by BrowserUk (Patriarch) on Jun 26, 2003 at 20:17 UTC

    One immediate saving that leaps out at me is that you can replace this

    while ($sub =~ /([GCATN])/g) { if ($1 eq 'G') { $bases{g}++; } if ($1 eq 'C') { $bases{c}++; } if ($1 eq 'A') { $bases{a}++; } if ($1 eq 'T') { $bases{t}++; } if ($1 eq 'N') { $bases{n}++; } }

    with

    $bases{ lc($1) }++ while $sub =~ /[GCATN])/g;

    Which should speed that bit up. If you can live with your keys being uppercase rather than lower, omit the call to lc. I realise that isn't the major part of time though.

    When you dealing with such long strings and especially if you are matching against them multiple times, then studying the string can make a remarkable difference. I'm not sure if this is true where the string contains so few unique characters, but it would be worth testing.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      At a guess, this should be _much_ faster:
      $bases{g} += $sub =~ y/G//;
      Because there's no pattern't matching (as such), looping, or whatever involved.

      edit: A quick benchmark shows that 5 transliterations is 100 times faster than the while loop for a $sub 10,000 characters long. That's better than I expected.

      Also, although it's really a microoptimisation, I think it's always worthwhile remembering that ++$a is more efficient than $a++.

      Jasper
Re: Performance Tuning: Searching Long-Sequence Permutations
by chip (Curate) on Jun 26, 2003 at 20:42 UTC
    You're recompiling those regexes every time through the loop, which is fairly expensive. Use the "qr" syntax to create compiled regexes in the @promoters_regex array, thusly:

    $promoters_regex[0] = qr/[NT][GCATN][NG][NC][NG][NT][NG]/; $promoters_regex[1] = qr/[NG][NT][NG][NC][NG][GCATN][NT]/; $promoters_regex[2] = qr/[NA][GCATN][NC][NG][NC][NA][NC]/; $promoters_regex[3] = qr/[NC][NA][NC][NG][NC][GCATN][NA]/; ... $permutations_total++ while $string =~ $promoters_regex[$j];

    You would also find it a bit faster if you could combine all the regexes into a big alternative regex, but then you couldn't match multiple promoters that overlap each other so that's probably gonna require lookaheads of some kind. Hm....

    PS: Most of the time when you write a for loop with an index iterating through the elements of an array, you'd be better off with a foreach loop. This is one of those cases:

    for (@promoters_regex) { $total++ while $string =~ $_; }

        -- Chip Salzenberg, Free-Floating Agent of Chaos

      $permutations_total++ while $string =~ $promoters_regex[$j];

      I guess you want:

      $permutations_total++ while $string =~ /$promoters_regex[$j]/g;

      here, otherwise you'll never terminate once there's a match. But now, qr will actually be slower, because you're are interpolating the compiled regex, and recompiling it. And if there are many matches, some gain could be made by writing it as:

      { no warnings 'numeric'; $permutations_total += $string =~ /$promoters_regex[$j]/g; }

      Abigail

        WRT interpolation in /$pat/g: Actually it is fast. When the entire pattern consists of an interpolated qr reference, no recompilation takes place at runtime, so the speed gain of precompilation is not lost. You can observe this with a combination of perl -Dts and reading the source code of pp_regcomp.

        WRT the missing //g: It's a fair cop, but society's to blame. :-)

            -- Chip Salzenberg, Free-Floating Agent of Chaos

        As I stated in /o is dead, long live qr//!, the form /$promoters_regex[$j]/g is not interpolated. The qr object is the entire regex and is not stringified. Your expected overhead does not materialize in this case. Writing it as / $promoters_regex[$j] /gx would trigger the recompilation overhead though.

        Never thought I'd be correcting Abigail, but this will only add 1 to $permutations total. A //g match in scalar context only ever returns 1 (or 0), no matter how many times it matches. Unfortunately, to make it work, you have to run it through an array first
        $permutations_total += @array = $string =~ /$promoters_regex[$j]/g;
        Which isn't very nice. Don't know how to get around that.

        Jasper
Re: Performance Tuning: Searching Long-Sequence Permutations
by dragonchild (Archbishop) on Jun 26, 2003 at 21:07 UTC
    There are a number of local improvements you can make.
    ### Typical sequence lengths are 10-50k letters my $sub = "CGAGCGTTGACGCNNAGCTAGT"; ### Count base frequencies while ($sub =~ /([GCATN])/g) { if ($1 eq 'G') { $bases{g}++; } if ($1 eq 'C') { $bases{c}++; } if ($1 eq 'A') { $bases{a}++; } if ($1 eq 'T') { $bases{t}++; } if ($1 eq 'N') { $bases{n}++; } } ### I need to create an array of bases to pass to rand() my @bases = ( ('G') x $bases{g}, ('C') x $bases{c}, ('A') x $bases{a}, ('T') x $bases{t}, ('N') x $bases{n}, );
    can become
    my $sub = "CGAGCGTTGACGCNNAGCTAGT"; my @bases = split //, $sub;
    Then, you build your regexes. Use qr. That'll make a huge difference. (In my benchmarking below, I more-than-doubled my number run per second from 81/s to 174/s over a 3 CPU second run.)

    The for-loops definitely also can use a lot of improvement. This is how I would write your script:

    my $sub = 'CGAGCGTTGACGCNNAGCTAGT'; my @bases = split //, $sub; my $count = 100; my $len = length $sub; @promoters_regex = ( qr'[NT][GCATN][NG][NC][NG][NT][NG]', qr'[NG][NT][NG][NC][NG][GCATN][NT]', qr'[NA][GCATN][NC][NG][NC][NA][NC]', qr'[NC][NA][NC][NG][NC][GCATN][NA]', ); my $permutations_total = 0; while ($count--) { my $ilen = $len; my $string; $string .= $bases[rand @bases] while $ilen--; foreach my $regex (@promoters_regex) { $permutations_totals += scalar($string =~ m/$regex/g); } }
    The obligatory benchmark information:

      You keep computing the length of $sub every iteration, even though it's static.
      length() is a property of a string and does not require computation. Remember, this is not C.

        Does the length builtin get called as a function or is it translated (or optimised) to a "direct reference" to the target string length property at compile time?


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: Performance Tuning: Searching Long-Sequence Permutations (Use math not code?)
by BrowserUk (Patriarch) on Jun 27, 2003 at 09:22 UTC

    Warning: This post contains a lot of speculation and a little code. Basically, reading between the lines of your perl, I conclude that there is no need to performance tune your code because the outcome, your $permutations_total can be approximated for any given combination of input 'start sequence' and set of regexes by performing a probablility calculation.

    I conclude this on the basis that you are not matching against an actual sequence, but against (a large number) of random permutations of that actual sequence. Therefore, you should be able to predict the outcome using statistics, without actually performing the task, which would be hugely more efficient. If its true. Ignire the rest of this post if I missed the mark again.

    After giving this some thought, I think that you (and we:) are approaching this from the wrong direction. The thing I noticed was that the sample regexes you supplied all match exactly 7 characters. I'm not sure if it is always 7 characters, or if all your regex are the same length, but the important thing is that you are probably always looking for a sequence of adjacent characters. If this is so, the you can probably speed the process up.

    What your asking the regex negine to do when you match a long string against a regex that is fixed length is to say

    1. Do the first n-chars of the string match the regex?
    2. Drop the first char. Add the n+1 char to the end.
    3. Does it match now?
    4. Goto step 2 for each of the 49,993 x 7-char substrings of your 50k char string.

    And then you repeat this process for each of the other 23 regexes. And then repeat that for each of your random permutations.

    What first struck me was that rather than building 100_000 10-50hk random strings, and asking the regex engine to search each of these 24 times, that you should turn the thing around. Build a random 7-char string and match that against each of the 24 regexes. Then drop the first char and tack a new random char on the end. Rinse and repeat until your requirements are satisfied.

    Benchmark to test the efficiency of this approach

    That seems to run pretty quickly taking around 30 seconds to check the 49,997 7-chars substrings in each randomly generated 50,000 string, against each of 24 regexes. I'm not sure how this compares with letting the regex engine do all the work, but it seems like it might be quicker.

    The next logical step...

    ... would be to remove the regexes from the picture completely by permuting each of them to produce each of the possible sequences that they can match and then use index on the unique ones. Eg.

    The regex /[NT][GCATN][NG][NC][NG][NT][NG]/ can match (2 * 5 * 2 * 2 * 2 * 2 *2) = 320 x 7-chars sequences.

    And /[NG][NT][NG][NC][NG][GCATN][NT]/ can match (2 * 2 * 2 * 2 * 2 * 5 * 2) = 320 x 7-char sequences.

    However, whilst the 320 sequences are unique within each group, if you combine them instead of 640, you get 608. And if you go through the process for the 4 example regex that you provided instead of 1280, you end up with only 1205 unique 7-char sequences. Not a big saving, but worth having when your iterating 100,000 times:).

    The analysis of the regex, and and a count of the number of unique 7-char sequences they would match can be quickly performed using this program

    Then I got to thinking about what you are actually measuring.

    What are you actually measuring?

    You're counting how many times any of the 7-byte sequences turn up in a large number of random combinations of strings that you generate with the frequency of each character weighted according to the weights of your original 'start' sequence.

    What you appear to be doing is a statistical analysis of the chance that any of a given set of 7-byte sequences (as represented by your 24 regex) is likely to appear in any particular start_sequence of 50k chars.

    No need to search

    It was then that it struck me that you don't need to do any serching at all!

    Assuming (that you are assuming:), that the random number generator is "truely random", then what you are doing can be reduced to a simple(*) probability calculation.

    (*)Now, I say simple, but I will admit that my math is not sufficient to work out the formula required to calculated the probabllity in question, but I do know that there are others here that could. The formula should be somethng along the lines of:

    Crass at math, (especially stats:()

    Given a string of 50k chars, that contain the chars ('A', 'C', 'G', 'T', 'N') in the proportions (a, c, g, t, n) respectively, then there at P= 7P5 possible 7-char combinations. Given these 24 regexes, that can match 24 * 320 = 7680 (minus a few for non-unique overlap), the probability that the 50k string will contain each of those 7680(-a few) is 1-(P/M)???

    Formula wanted

    Now that's a long way from being the correct formula, but I get lost in the math when you through in the weighting factors, but I'm sure that this could be reduced such a formula.

    All you would need is the weights of the characters plus an analysis of the regexes and your average permutations count should be directly derivable.

    Over to someone else for the real math, always assuming any of what I've written is applicable to your desired goal.

    Sorry for the spelling, grammer and typos, but now I've got a headache, so I'm going to lie down:)


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: Performance Tuning: Searching Long-Sequence Permutations
by sgifford (Prior) on Jun 26, 2003 at 20:49 UTC
    You may find that studying $string after creating it will speed things up. You may find it won't, too, but it's worth a try.
Re: Performance Tuning: Searching Long-Sequence Permutations
by thraxil (Prior) on Jun 26, 2003 at 20:49 UTC

    is there any way you can first use something like BLAST to cut down the search space? i don't really know enough about your application, but trying to find a way to let BLAST do some of the work would be my first instinct. maybe you can get BLAST to give you a list of the substrings with very rough alignments (which it will do very quickly. 10-50k base-pairs is nothing to BLAST) and then just search through those with your regexps.

    anders pearson

      An excellent idea. I would add to it by suggesting that you implement a parser for the resulting BLAST report. The BLAST algorithm extends a maximal scoring segment and will pick up substrings that do not perhaps meet minimum length requirements etc. This is where the parser comes in. Set up bioperl (www.bioperl.org) on your system and use Bio::SearchIO for parsing your report for relevant hits.

        I definitely agree, Bioperl is the best choice for writing parsers for most sequence or alignment formats, such as FASTA, BLAST, ClustalX, etc.

        -Tats

      Sadly I can't use BLAST for this case because BLAST does not account for "fuzzy matching". I'll explain that in a second, but in case anyone isn't familiar with it, let me describe BLAST a bit.

      BLAST = Basic Local Alignment Search Tool. It is a way of solving the general problem of aligning two sequences. There is an optimal solution to the alignment of two sequences (a variation of dynamic programming, termed Smith-Waterman alignment). Unfortunately that solution is slow (O(n2), I believe). BLAST is much faster at aligning a single sequence against a database of sequences. What it does is search each member of the databases for short, exact matches from the query sequence. If it does not find these matches, it does not attempt full dynamic programming on that sequence. In other words, BLAST makes the assumption that two sequences with high similarity will have short subsequences of extremely high similarity. One of the big advantages of BLAST is that the statistics of BLAST searching can be estimated (it approximates an extreme-value distribution for the general case). This allows the assignation of a probability value to a similarity, and biologists like p-values.

      The scoring system of BLAST is based on two things: a penalty for introducing a gap, and a scoring matrix that gives the +ve score for any given match and the -ve score for any given mismatch. This system is not flexible enough to do the kind of searching I am looking at for three reasons:

      1. Most BLAST software cannot handle matching of N's (an N is an ambiguous base -- it could be any of the other four
      2. No BLAST software can handle matching of all other ambiguity codes (e.g. Y for purines, R for pyrimidines, etc.)
      3. BLAST can't handle "variable" gaps -- for instance one couldn't duplicate the search /GC.{10,13}AT/ in a BLAST

      So, unfortunately this is a problem out of the scope of sequence alignments. Also, sequence alignment tools generally work well with large sequences, not with 10-20 letter ones. Some types of BLAST will actually be unable to perform alignments with sequences below 7 or 8 letters.

      Sorry for the long explanation, but it's a really good idea and I wanted to explain why it wouldn't work for my situation

      -Tats
Re: Performance Tuning: Searching Long-Sequence Permutations
by Itatsumaki (Friar) on Jun 27, 2003 at 17:07 UTC

    Thanks everybody for the suggestions. I'm testing them out, and looking at BrowserUK's idea of trying to get around the permutation with statistics: the general problem is mathematically unsolvable (it breaks down to a large set of recurrence relations, I've been told), but some of his ideas might be dramatically more efficient. Thanks again -- I'm overwhelmed by the responses.

    -Tats

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://269390]
Approved by gjb
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2024-04-25 07:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found