Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation

by BhariD (Sexton)
on Apr 19, 2010 at 18:37 UTC ( #835568=perlquestion: print w/ replies, xml ) Need Help??
BhariD has asked for the wisdom of the Perl Monks concerning the following question:

Here is a part of the code that I have pasted followed by my question. This program uses, perl and one function below uses bioperl to obtain all the possible sequences from a given degenerate iupac string

For example, I will use the iupac string here as:

'NNYDBAVDVHVHNGGNR'

Here, A,C,G, or T are single characters, everything else either carries 2, 3 or 4 characters as follows:

N => ACGT

Y => CT

D => AGT

B => CGT

V => ACG

H => ACT

R => AG

So we can see how many possible combinations can come out of this iupac string probably >10000000 seqs with only A,C,G or T characters

#!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; use Bio::SeqFeature::Primer; use Bio::Tools::IUPAC; use Bio::Tools::SeqPattern; #some parameters for the equation to compute Tm my $conc_primer = 0.25; my $conc_salt = 50; my $mgconc = 0; my $iupac = 'NNYDBAVDVHVHNGGNR'; #this calls the function to generate all possible seqs from iupac stri +ng my @PotSeqs = find_degenerates($iupac); my @Tm_Thermodynamic=(); foreach my $PotSeqs(@PotSeqs){ #this call the function to compute tm for each seq my $Tm_Thermodynamic = tm_Base_Stacking($PotSeqs,$conc_primer,$conc_sa +lt,$mgconc); push(@Tm_Thermodynamic, $Tm_Thermodynamic); #I have to add a function here to get the min and max tm values +from the Tm_thermodynamic array filled above # printf "%.4f", $Tm_Thermodynamic_min; # print "\t"; # printf "%.4f", $Tm_Thermodynamic_max; # print "\n"; } #function to get all possible combination of an iupac string sub find_degenerates { my $sequence = $_[0]; my $seq_obj = Bio::Seq->new(-seq => $sequence, -alphabet => 'dna') +; my $stream = Bio::Tools::IUPAC->new(-seq => $seq_obj); my @oligomers; while (my $uniqueseq = $stream->next_seq()) { push @oligomers, $uniqueseq->seq; } return @oligomers; } #Function to compute the Temperature for a given sequence (sequence va +lid only if contains A,C,G or T) sub tm_Base_Stacking { my ($c,$conc_primer,$conc_salt,$conc_mg) = @_; my $c_len = length($c); my $h=0; my $s=0; # enthalpy values my %array_h = ( 'AA'=> '-7.9', 'AC'=> '-8.4', 'AG'=> '-7.8', 'AT'=> '-7.2', 'CA'=> '-8.5', 'CC'=> '-8.0', 'CG'=> '-10.6', 'CT'=> '-7.8', 'GA'=> '-8.2', 'GC'=> '-10.6', 'GG'=> '-8.0', 'GT'=> '-8.4', 'TA'=> '-7.2', 'TC'=> '-8.2', 'TG'=> '-8.5', 'TT'=> '-7.9' ); # entropy values my %array_s = ( 'AA'=> '-22.2', 'AC'=> '-22.4', 'AG'=> '-21.0', 'AT'=> '-20.4', 'CA'=> '-22.7', 'CC'=> '-19.9', 'CG'=> '-27.2', 'CT'=> '-21.0', 'GA'=> '-22.2', 'GC'=> '-27.2', 'GG'=> '-19.9', 'GT'=> '-22.4', 'TA'=> '-21.3', 'TC'=> '-22.2', 'TG'=> '-22.7', 'TT'=> '-22.2' ); #correction for salt concentration my $salt_effect= ($conc_salt/1000)+(($conc_mg/1000) * 140); # effect on entropy $s+=0.368 * ($c_len-1)* log($salt_effect); # compute new H and s based on sequence. for(my $i=0; $i<$c_len-1; $i++){ my $subc=substr($c,$i,2); $h+=$array_h{$subc}; $s+=$array_s{$subc}; } my $rlnk = ($conc_primer/2000000000); my $r = log10($rlnk); #equation to compute the Tm my $tm=((1000*$h)/($s+(1.987*$r)))-273.15; return $tm; } #function to take the log sub log10 { my $n = $_[0]; return log($n)/log(10); }

my question is how can I make this efficient. I only am looking for the range that is Min and Max Tm or temperature values for a given iupac string. Here, what I am doing is I am computing Tm for every possible seqs of a given iupac to get the accurate range, but this is too slow and inefficient.. I am running totally out of memory and it becomes worse when iupac degeneracy increases. Is there a way I can make it efficient and still be able to get more or less an accurate range (min and max Tm values)?

I would appreciate any help/suggestions. Please let me know if I need to further explain anything.

Comment on how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
Download Code
Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by leocharre (Priest) on Apr 19, 2010 at 19:04 UTC
Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by kennethk (Monsignor) on Apr 19, 2010 at 19:11 UTC
    I note that at no point do you call tm_NNH, which I assume to be your temperature calculation, and you call the undefined subroutine tm_Base_Stacking which has the same argument list. You are also missing a sigil ($) in front of iupac. This tells me you did not publish the code that is actually giving you difficulty. Please post the code you are using in the future, or at least make sure Perl can compile what you post - see How do I post a question effectively?.

    A look at your tm_NNH subroutine as posted shows no obvious reason why you would have memory leaks. I also note that your find_degenerates subroutine creates a large number of objects - in my mind, these would be the most likely source of your memory woes. I would suggest using Devel::Monitor to make sure the objects are actually being garbage collected.

Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by almut (Canon) on Apr 19, 2010 at 19:19 UTC
    while (my $uniqueseq = $stream->next_seq()) { push @oligomers, $uniqueseq->seq; }

    Instead of expanding all possible combinations into an array that you then iterate over, why not directly call tm_Base_Stacking(...) (or tm_NNH() for that matter(?)) in that loop?

    That should at least get rid of the memory problem (unless the Bio::Tools::IUPAC iterator internally pre-expands the set itself), although the runtime problem would likely still persist...

Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by BrowserUk (Pope) on Apr 19, 2010 at 19:22 UTC
    • How big is the array returned by find_degenerates()?
    • How long does it take now?
    • Can you post the sequence and calculated temperature for the 10 lowest and 10 highest.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by kyle (Abbot) on Apr 19, 2010 at 21:47 UTC

    I suggest you run your program under Devel::NYTProf to find out where it's spending all its time. Then you can focus on where the problem really is.

Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by MadraghRua (Vicar) on Apr 20, 2010 at 18:25 UTC
    Treat it less as a brute force challenge and think about your data. You have a sting
    NNYDBAVDVHVHNGGNR

    What parameters will increase the Tm? - C and G residues
    What parameters will decrease the Tm? - A and T residues

    For each residue, find if IUPAC can give you a C or a G for the max temperature at that residue, then a T or an A for the min temperature.

    If you really need to calculate over large numbers of primers then divide the primers up into percentiles, eg 90% GC, 80% GC and so on. Its an approximation but good enough for what I think you are doing.

    If you need more accuracy yet, look at your hash values - this shows you the residue pairs that give you most and least values, so see waht pairs of residues are likely for the extremes of these values and bias the calculations to those primers

    Just out of curiosity where are you getting the enthalpy and entrophy values from? St. Lucia?

    One other thought - if you want accurate temps, you're going to want to calculate on the monovalent and divalent ion concentrations. I suggest looking at Primer 3 for ways to calculate these values. Or simply call Primer3 rather than reinventing this in your app - it gives this information anyway.

    Good luck

    MadraghRua
    yet another biologist hacking perl....

Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by NiJo (Friar) on Apr 21, 2010 at 21:22 UTC
    Your program currently uses only 1 CPU. A cheap way to parallel programming is splitting your task into a 'string expander' that sends expanded strings over one by one and a seperate min/max searcher called on the shell as e. g.
    string_expander <start string> | minmax
    This does not need any large intermediate storage in RAM.

    The classical way to store large data structures is in a file created by string_expander and read by minmax line by line. The pipe saves that step and replaces it with standard output on string_expander and standard output of minmax.

    A premature micro optimisation is to move the calculation of salt effect out of the loop per string, as it is constant.

    This is still the brute force method, but done better. If you put more brain to the problem, probabilities of large numbers may play to your favor. I did not grasp your temperature calculation and final application completely, but somewhere in the large number of expanded strings the maximum and minimum 2 letter combinations of %array_h and %array_s are contained. The most probable answer for min/max temperature over all expanded strings depends only on string length and best/worst case substitions. As there is only one potenial substitution for each meta letter (N, Y...) and maximum temperature and another substitution for minimum temperature, you could do the whole calculation by hand.

    I have no idea what you are scientifically doing with the results of your calculations. But assuming a more or less linear relationship with your 30% spread in input variables (%array_s and %array_h), skipping the calculation might just have a +-15 % spread on output (assuming 3 standard deviations between min and max). Is 5% RSD really larger than the error margin of your theoretical calculation?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://835568]
Approved by Corion
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (10)
As of 2014-07-29 09:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (213 votes), past polls