BhariD has asked for the wisdom of the Perl Monks concerning the following question:

Here is a part of the code that I have pasted followed by my question. This program uses, perl and one function below uses bioperl to obtain all the possible sequences from a given degenerate iupac string

For example, I will use the iupac string here as:

'NNYDBAVDVHVHNGGNR'

Here, A,C,G, or T are single characters, everything else either carries 2, 3 or 4 characters as follows:

N => ACGT

Y => CT

D => AGT

B => CGT

V => ACG

H => ACT

R => AG

So we can see how many possible combinations can come out of this iupac string probably >10000000 seqs with only A,C,G or T characters

```
#!/usr/bin/perl
use strict;
use warnings;

use Bio::SeqIO;
use Bio::SeqFeature::Primer;
use Bio::Tools::IUPAC;
use Bio::Tools::SeqPattern;

#some parameters for the equation to compute Tm
my \$conc_primer = 0.25;
my \$conc_salt = 50;
my \$mgconc = 0;

my \$iupac = 'NNYDBAVDVHVHNGGNR';

#this calls the function to generate all possible seqs from iupac stri
+ng
my @PotSeqs = find_degenerates(\$iupac);
my @Tm_Thermodynamic=();

foreach my \$PotSeqs(@PotSeqs){
#this call the function to compute tm for each seq
my \$Tm_Thermodynamic = tm_Base_Stacking(\$PotSeqs,\$conc_primer,\$conc_sa
+lt,\$mgconc);
push(@Tm_Thermodynamic, \$Tm_Thermodynamic);

#I have to add a function here to get the min and max tm values
+from the Tm_thermodynamic array filled above
# printf "%.4f", \$Tm_Thermodynamic_min;
# print "\t";
# printf "%.4f", \$Tm_Thermodynamic_max;
# print "\n";
}

#function to get all possible combination of an iupac string
sub find_degenerates
{
my \$sequence = \$_[0];
my \$seq_obj = Bio::Seq->new(-seq => \$sequence, -alphabet => 'dna')
+;
my \$stream  = Bio::Tools::IUPAC->new(-seq => \$seq_obj);

my @oligomers;
while (my \$uniqueseq = \$stream->next_seq()) {
push @oligomers, \$uniqueseq->seq; }
return @oligomers;
}

#Function to compute the Temperature for a given sequence (sequence va
+lid only if contains A,C,G or T)

sub tm_Base_Stacking
{
my (\$c,\$conc_primer,\$conc_salt,\$conc_mg) = @_;
my \$c_len = length(\$c);
my \$h=0;
my \$s=0;

# enthalpy values
my %array_h = (  'AA'=> '-7.9',
'AC'=> '-8.4',
'AG'=> '-7.8',
'AT'=> '-7.2',
'CA'=> '-8.5',
'CC'=> '-8.0',
'CG'=> '-10.6',
'CT'=> '-7.8',
'GA'=> '-8.2',
'GC'=> '-10.6',
'GG'=> '-8.0',
'GT'=> '-8.4',
'TA'=> '-7.2',
'TC'=> '-8.2',
'TG'=> '-8.5',
'TT'=> '-7.9'
);

# entropy values
my %array_s = (  'AA'=> '-22.2',
'AC'=> '-22.4',
'AG'=> '-21.0',
'AT'=> '-20.4',
'CA'=> '-22.7',
'CC'=> '-19.9',
'CG'=> '-27.2',
'CT'=> '-21.0',
'GA'=> '-22.2',
'GC'=> '-27.2',
'GG'=> '-19.9',
'GT'=> '-22.4',
'TA'=> '-21.3',
'TC'=> '-22.2',
'TG'=> '-22.7',
'TT'=> '-22.2'
);

#correction for salt concentration
my \$salt_effect= (\$conc_salt/1000)+((\$conc_mg/1000) * 140);
# effect on entropy
\$s+=0.368 * (\$c_len-1)* log(\$salt_effect);

# compute new H and s based on sequence.
for(my \$i=0; \$i<\$c_len-1; \$i++){
my \$subc=substr(\$c,\$i,2);
\$h+=\$array_h{\$subc};
\$s+=\$array_s{\$subc};
}

my \$rlnk = (\$conc_primer/2000000000);
my \$r = log10(\$rlnk);

#equation to compute the Tm
my \$tm=((1000*\$h)/(\$s+(1.987*\$r)))-273.15;
return \$tm;
}

#function to take the log
sub log10 {
my \$n = \$_[0];
return log(\$n)/log(10);
}

my question is how can I make this efficient. I only am looking for the range that is Min and Max Tm or temperature values for a given iupac string. Here, what I am doing is I am computing Tm for every possible seqs of a given iupac to get the accurate range, but this is too slow and inefficient.. I am running totally out of memory and it becomes worse when iupac degeneracy increases. Is there a way I can make it efficient and still be able to get more or less an accurate range (min and max Tm values)?

I would appreciate any help/suggestions. Please let me know if I need to further explain anything.

• Comment on how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation

Replies are listed 'Best First'.
Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by almut (Canon) on Apr 19, 2010 at 19:19 UTC
```while (my \$uniqueseq = \$stream->next_seq()) {
push @oligomers, \$uniqueseq->seq; }

Instead of expanding all possible combinations into an array that you then iterate over, why not directly call tm_Base_Stacking(...) (or tm_NNH() for that matter(?)) in that loop?

That should at least get rid of the memory problem (unless the Bio::Tools::IUPAC iterator internally pre-expands the set itself), although the runtime problem would likely still persist...

Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by kennethk (Abbot) on Apr 19, 2010 at 19:11 UTC
I note that at no point do you call tm_NNH, which I assume to be your temperature calculation, and you call the undefined subroutine tm_Base_Stacking which has the same argument list. You are also missing a sigil (\$) in front of iupac. This tells me you did not publish the code that is actually giving you difficulty. Please post the code you are using in the future, or at least make sure Perl can compile what you post - see How do I post a question effectively?.

A look at your tm_NNH subroutine as posted shows no obvious reason why you would have memory leaks. I also note that your find_degenerates subroutine creates a large number of objects - in my mind, these would be the most likely source of your memory woes. I would suggest using Devel::Monitor to make sure the objects are actually being garbage collected.

Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by BrowserUk (Pope) on Apr 19, 2010 at 19:22 UTC
• How big is the array returned by find_degenerates()?
• How long does it take now?
• Can you post the sequence and calculated temperature for the 10 lowest and 10 highest.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by kyle (Abbot) on Apr 19, 2010 at 21:47 UTC

I suggest you run your program under Devel::NYTProf to find out where it's spending all its time. Then you can focus on where the problem really is.

Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by MadraghRua (Vicar) on Apr 20, 2010 at 18:25 UTC
Treat it less as a brute force challenge and think about your data. You have a sting
NNYDBAVDVHVHNGGNR

What parameters will increase the Tm? - C and G residues
What parameters will decrease the Tm? - A and T residues

For each residue, find if IUPAC can give you a C or a G for the max temperature at that residue, then a T or an A for the min temperature.

If you really need to calculate over large numbers of primers then divide the primers up into percentiles, eg 90% GC, 80% GC and so on. Its an approximation but good enough for what I think you are doing.

If you need more accuracy yet, look at your hash values - this shows you the residue pairs that give you most and least values, so see waht pairs of residues are likely for the extremes of these values and bias the calculations to those primers

Just out of curiosity where are you getting the enthalpy and entrophy values from? St. Lucia?

One other thought - if you want accurate temps, you're going to want to calculate on the monovalent and divalent ion concentrations. I suggest looking at Primer 3 for ways to calculate these values. Or simply call Primer3 rather than reinventing this in your app - it gives this information anyway.

Good luck

yet another biologist hacking perl....

Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by leocharre (Priest) on Apr 19, 2010 at 19:04 UTC
Re: how to make this code more efficient? Need to Compute the Min and Max of a Temperature using an equation
by NiJo (Friar) on Apr 21, 2010 at 21:22 UTC
Your program currently uses only 1 CPU. A cheap way to parallel programming is splitting your task into a 'string expander' that sends expanded strings over one by one and a seperate min/max searcher called on the shell as e. g.
```string_expander <start string> | minmax