Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Re: Finding Neighbours of a String

by inman (Curate)
on Mar 01, 2006 at 11:35 UTC ( #533623=note: print w/replies, xml ) Need Help??

in reply to Finding Neighbours of a String

I have used the Math::Combinatorics to work out the combinations and permutations needed. The aim is to work out all of the ways in which a string of a certain hamming distance could be created rather than create all the possible permutations of the string and test them.

use strict; use warnings; use Data::Dumper; use Math::Combinatorics; my $str = "TTTCGG"; my @str = split //, $str; my $hammingDistance = 2; # Calculate the total number of ways of getting the bases my %bases; my @baseopts = split //, 'ACGT' x $hammingDistance; my $basecomb = Math::Combinatorics->new(count => $hammingDistance, dat +a => [@baseopts], ); while(my @combo = $basecomb->next_combination){ my $permu = Math::Combinatorics->new(data => [@combo], ); while(my @permu = $permu->next_permutation){ $bases{join '', @permu}++; } } # Make the list unique my @baseperms; push @baseperms, [split //, $_] foreach (keys %bases); my %results; my @n = (0 .. $#str); # Calculate all the permutations of position that could give a change # and work through the base combinations my $poscomb = Math::Combinatorics->new(count => $hammingDistance, data + => [@n], ); while(my @combo = $poscomb->next_combination){ foreach my $bases (@baseperms){ my @newstr = @str; @newstr[@combo] = @$bases; $results{ join('', @newstr) }++; } } print "$_ ", hd($_, $str), "\n" foreach (sort keys %results); sub hd { return ( $_[0] ^ $_[1] ) =~ tr/\001-\255//; }

Replies are listed 'Best First'.
Re^2: Finding Neighbours of a String
by Aristotle (Chancellor) on Mar 01, 2006 at 11:44 UTC

    Your approach generates oodles of duplicates which it then filters back out by eating up memory for hashes. The approach I outlined above generates no dupes to begin with.

    I didnít know about Math::Combinatorics though; nice module. It did annoy me that I had to use two different modules with confusing differences in their APIs. Iíll update my code to use M::C instead. Nope, doesnít help, still need S::CP.

    Makeshifts last the longest.

      Is this such a bad thing? The OP suggests that he is going to do this for a large number of strings. My code builds up the combinations first in a preparation step and then calculates the different strings that have the desired HD. The prep work can then be re-used for all strings of the same length.

        Depends on just how many dupes you produce to get there. If you have to throw away more dupes than you generated valid neighbours in the first place, it seems much better to invest a fraction of the effort in redoing the combinatorics over and over. You get to save all the memory too.

        The first versions of the approach I went with were not directly designed to avoid duplicates, and produced nearly 4◊ as many results as there were unique results, for a Hamming distance of 2 on a string of length 6. I assume that as numbers go up, any approach that does not avoid dupes to begin with will waste humongous amounts of time on them. Of course this is relatively off-the-cuff; I havenít reasoned it deeply, so it might not be as bad as I think.

        Makeshifts last the longest.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://533623]
[Corion]: marto: How's things? I hope the kids are fine and you too!
[Corion]: Oh yay. I wonder why a very simple change in a program doesn't elicit a diff, and now I see that my diff program seems to have a bug ;)
[1nickt]: marto thanks for asking, so far so good. A pretty modern stack and decent procedures, although rather too much home-built stuff (e.g. a logging role that should tries to duplicate Log::Any).
[Corion]: No. It's just that I'm comparing the same output file twice, instead of comparing the output files of the two runs %-)
[Corion]: Lo and behold, running a program with the correct input files yields the correct (and expected) output. Yay me.
[1nickt]: Got a MacBook and am expected to develop directly on it, ironic given the recent thread about that.
[marto]: Corion, some not too serious issues with the kids, hopefully, other than that just dealing with commuting by car again in the winter, not much fun so far, and there's no real 'bad' weather yet :)

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2017-12-11 11:39 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (289 votes). Check out past polls.