Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re: Finding Neighbours of a String

by inman (Curate)
on Mar 01, 2006 at 11:35 UTC ( #533623=note: print w/replies, xml ) Need Help??

in reply to Finding Neighbours of a String

I have used the Math::Combinatorics to work out the combinations and permutations needed. The aim is to work out all of the ways in which a string of a certain hamming distance could be created rather than create all the possible permutations of the string and test them.

use strict; use warnings; use Data::Dumper; use Math::Combinatorics; my $str = "TTTCGG"; my @str = split //, $str; my $hammingDistance = 2; # Calculate the total number of ways of getting the bases my %bases; my @baseopts = split //, 'ACGT' x $hammingDistance; my $basecomb = Math::Combinatorics->new(count => $hammingDistance, dat +a => [@baseopts], ); while(my @combo = $basecomb->next_combination){ my $permu = Math::Combinatorics->new(data => [@combo], ); while(my @permu = $permu->next_permutation){ $bases{join '', @permu}++; } } # Make the list unique my @baseperms; push @baseperms, [split //, $_] foreach (keys %bases); my %results; my @n = (0 .. $#str); # Calculate all the permutations of position that could give a change # and work through the base combinations my $poscomb = Math::Combinatorics->new(count => $hammingDistance, data + => [@n], ); while(my @combo = $poscomb->next_combination){ foreach my $bases (@baseperms){ my @newstr = @str; @newstr[@combo] = @$bases; $results{ join('', @newstr) }++; } } print "$_ ", hd($_, $str), "\n" foreach (sort keys %results); sub hd { return ( $_[0] ^ $_[1] ) =~ tr/\001-\255//; }

Replies are listed 'Best First'.
Re^2: Finding Neighbours of a String
by Aristotle (Chancellor) on Mar 01, 2006 at 11:44 UTC

    Your approach generates oodles of duplicates which it then filters back out by eating up memory for hashes. The approach I outlined above generates no dupes to begin with.

    I didnít know about Math::Combinatorics though; nice module. It did annoy me that I had to use two different modules with confusing differences in their APIs. Iíll update my code to use M::C instead. Nope, doesnít help, still need S::CP.

    Makeshifts last the longest.

      Is this such a bad thing? The OP suggests that he is going to do this for a large number of strings. My code builds up the combinations first in a preparation step and then calculates the different strings that have the desired HD. The prep work can then be re-used for all strings of the same length.

        Depends on just how many dupes you produce to get there. If you have to throw away more dupes than you generated valid neighbours in the first place, it seems much better to invest a fraction of the effort in redoing the combinatorics over and over. You get to save all the memory too.

        The first versions of the approach I went with were not directly designed to avoid duplicates, and produced nearly 4◊ as many results as there were unique results, for a Hamming distance of 2 on a string of length 6. I assume that as numbers go up, any approach that does not avoid dupes to begin with will waste humongous amounts of time on them. Of course this is relatively off-the-cuff; I havenít reasoned it deeply, so it might not be as bad as I think.

        Makeshifts last the longest.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://533623]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2019-05-20 23:00 GMT
Find Nodes?
    Voting Booth?
    Do you enjoy 3D movies?

    Results (129 votes). Check out past polls.