Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Finding Neighbours of a String

by monkfan (Curate)
on Mar 01, 2006 at 09:33 UTC ( #533604=perlquestion: print w/replies, xml ) Need Help??
monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Dear Fellow Monks,
Suppose I have a string ($str) and also the number of maximum mismatch position is given ($d).
$str = 'TTTCGG'; # (length 6) $d = 2;
What I intend to do is to find all the string of of length 6 that has maximum Hamming Distance (number of mismatches) is less or equal to $d. These strings are constructed with bases [ATCG].

I already have a brute-force way to do it. That is to pre-generate as many as 4^l, all the strings of length $l, and then find the neighbours from there.

But this way is way too time consuming to do it. Since there are many many strings to test. And also the length of the string is around 12-20 characters. Can anybody advice what's the best way to go about it?

The script that does brute-force way is this:
use strict; use warnings; use Data::Dumper; my $l = 6; #Motif Length my $d = 2; my $str = 'TTTCGG'; my @nucs = qw/A T C G/; my @enum = enum( $l, \@nucs ); foreach my $oligo ( @enum ) { if ( hd ($oligo,$str) <= $d ) { print "$oligo\n"; } } sub hd { return ( $_[0] ^ $_[1] ) =~ tr/\001-\255//; } sub enum { return @{ $_[1] } unless --$_[0]; map { my $nuc = $_; map { $nuc . $_ } @{ $_[1] } } enum( $_[0], $_[ 1 ] ); }


Replies are listed 'Best First'.
Re: Finding Neighbours of a String
by Aristotle (Chancellor) on Mar 01, 2006 at 10:04 UTC
    use Algorithm::Combinatorics qw( combinations ); use Set::CrossProduct; my @base = split //, $str; for my $exact_distance ( 1 .. $d ) { my $change_idx_iter = combinations( [ 0 .. $#base ], $exact_distan +ce ); while( my $change_idx = $change_idx_iter->next ) { my @base_combo = map { my $i = $_; [ grep { $base[ $i ] ne $_ } qw( A T C G ) ]; } @$change_idx; # HACK: Set::CrossProduct doesnít work with a 1-dimensional ma +trix push @base_combo, [ 0 ] if $exact_distance == 1; my $bases_iter = Set::CrossProduct->new( \@base_combo ); my @neighbour = @base; while( my $new_bases = $bases_iter->get ) { @neighbour[ @$change_idx ] = @$new_bases; print @neighbour, "\n"; } } }

    Updates: changed to use combinations vs variations and to grep out non-changes, so that it will produce no duplicates.

    To see whatís going on, add the following line before the print:

    $_ = "[$_]" for @neighbour[ @$change_idx ];

    Makeshifts last the longest.

      Dear Aristotle,

      Sorry for coming back to you again. How can I extend/modify your invaluable code above so that it can handle ambiguous string like this:
      $str = '[TCG]TTCG[AT]';
      The idea is exactly the same as my OP, only this time those characters under brackets [] is also considered as mismatch possibilities.

      Please kindly keep, your original answer.


        Ah, thanks for your clarification. Thatís easy: change the split line to

        my @base = $str =~ /\G ( \[ [^][]+ \] | [^][] ) /xg;

        which will parse the string into units of either a single letter or a bracketed sequence, and change the grep line to

        [ grep { $base[ $i ] !~ $_ } qw( A T C G ) ];

        so that the letter in question will be thrown out if it matches anywhere in a bracketed sequence.

        Thatís all, youíre done.

        Makeshifts last the longest.

        What exactly do you mean by ďalso considered as mismatch possibilities?Ē

        Makeshifts last the longest.

Re: Finding Neighbours of a String
by inman (Curate) on Mar 01, 2006 at 11:35 UTC
    I have used the Math::Combinatorics to work out the combinations and permutations needed. The aim is to work out all of the ways in which a string of a certain hamming distance could be created rather than create all the possible permutations of the string and test them.

      Your approach generates oodles of duplicates which it then filters back out by eating up memory for hashes. The approach I outlined above generates no dupes to begin with.

      I didnít know about Math::Combinatorics though; nice module. It did annoy me that I had to use two different modules with confusing differences in their APIs. Iíll update my code to use M::C instead. Nope, doesnít help, still need S::CP.

      Makeshifts last the longest.

        Is this such a bad thing? The OP suggests that he is going to do this for a large number of strings. My code builds up the combinations first in a preparation step and then calculates the different strings that have the desired HD. The prep work can then be re-used for all strings of the same length.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://533604]
Approved by McDarren
and all is calm...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2017-06-29 06:36 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (654 votes). Check out past polls.