in reply to Finding Neighbours of a String
I have used the Math::Combinatorics to work out the combinations and permutations needed. The aim is to work out all of the ways in which a string of a certain hamming distance could be created rather than create all the possible permutations of the string and test them.
use strict;
use warnings;
use Data::Dumper;
use Math::Combinatorics;
my $str = "TTTCGG";
my @str = split //, $str;
my $hammingDistance = 2;
# Calculate the total number of ways of getting the bases
my %bases;
my @baseopts = split //, 'ACGT' x $hammingDistance;
my $basecomb = Math::Combinatorics>new(count => $hammingDistance, dat
+a => [@baseopts], );
while(my @combo = $basecomb>next_combination){
my $permu = Math::Combinatorics>new(data => [@combo], );
while(my @permu = $permu>next_permutation){
$bases{join '', @permu}++;
}
}
# Make the list unique
my @baseperms;
push @baseperms, [split //, $_] foreach (keys %bases);
my %results;
my @n = (0 .. $#str);
# Calculate all the permutations of position that could give a change
# and work through the base combinations
my $poscomb = Math::Combinatorics>new(count => $hammingDistance, data
+ => [@n], );
while(my @combo = $poscomb>next_combination){
foreach my $bases (@baseperms){
my @newstr = @str;
@newstr[@combo] = @$bases;
$results{ join('', @newstr) }++;
}
}
print "$_ ", hd($_, $str), "\n" foreach (sort keys %results);
sub hd {
return ( $_[0] ^ $_[1] ) =~ tr/\001\255//;
}
Re^2: Finding Neighbours of a String by Aristotle (Chancellor) on Mar 01, 2006 at 11:44 UTC 
Your approach generates oodles of duplicates which it then filters back out by eating up memory for hashes. The approach I outlined above generates no dupes to begin with.
I didn’t know about Math::Combinatorics though; nice module. It did annoy me that I had to use two different modules with confusing differences in their APIs. I’ll update my code to use M::C instead. Nope, doesn’t help, still need S::CP.
Makeshifts last the longest.
 [reply] 

Is this such a bad thing? The OP suggests that he is going to do this for a large number of strings. My code builds up the combinations first in a preparation step and then calculates the different strings that have the desired HD. The prep work can then be reused for all strings of the same length.
 [reply] 

Depends on just how many dupes you produce to get there. If you have to throw away more dupes than you generated valid neighbours in the first place, it seems much better to invest a fraction of the effort in redoing the combinatorics over and over. You get to save all the memory too.
The first versions of the approach I went with were not directly designed to avoid duplicates, and produced nearly 4× as many results as there were unique results, for a Hamming distance of 2 on a string of length 6. I assume that as numbers go up, any approach that does not avoid dupes to begin with will waste humongous amounts of time on them. Of course this is relatively offthecuff; I haven’t reasoned it deeply, so it might not be as bad as I think.
Makeshifts last the longest.
 [reply] 
