http://www.perlmonks.org?node_id=1029679

supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi PerlMonks,

I have a string i.e. $string="ATATGCGCAT" 10-letter long comprising of four letters A,T,G,C. I am interested in getting all possible combinations of 10-letter without changing their positions in the string and considering 2 (or varying) levels for each of A,T,G & C. Moreover, I have used a sliding window of size 4 in the script try.pl. I want to keep the provision of the window size in the script. This is because when the string length is more than 40 with varying levels of basic letters, then the number of possible combinations becomes very large and cmd does not give the results. Using window size at first I want to divide the string into fragments. Each smaller fragment will be used to produce a set of combinations. Then the first combination of the first fragment will be concatenated with the first combination of the second fragment to produce a new combination, which will then be concatenated with the first combination of third fragment till the entire length of the original string. Similarly, other combinations will be produced.

I have written a script try.pl which produces all combinations of varying sizes (ranging from 1 to 8 letters only). I need only the combinations of actual length of the original string (i.e. 10 in this case) in the output file & each combination starting with a symbol "~" and ending with "~". I am at my wit's end to solve this problem.

Here goes the script try.pl:

#!/usr/bin/perl use warnings; $string="ATATGCGCAT"; ########################################### # Output to a TEXT File: ########################################### $output="Results .txt"; open (my $fh,">",$output) or die"Cannot open file '$output'.\n"; ##################################### # To break into 4-letter fragments: ##################################### while ($string=~ /(.{4}?)/ig) {$four=$&; @sw=$four=~ /[ATGC]{1}/igs; foreach my $single (@sw) { #################################################### # To extract single letter & append perd to single: #################################################### $perd="%d"; $mod_four=$single.$perd; # concatenation push @new_four,$mod_four; $new_four = join ('',@new_four); # To produce all possible combinations without changing positions: for $a (1 .. 2) { # a has 2 levels: for $t (1 .. 2) { # t has 2 levels: for $g (1 .. 2) { # g has 2 levels: for $c (1 .. 2) { # c has 2 levels: $combi=sprintf($new_four,$a,$t,$g,$c,3-$a,3-$t,3-$g,3-$c); print"~$combi\n"; print $fh "~$combi\n"; } } } } } # 2nd foreach closes: } # 1st while closes: print"~"; print"\n"; print $fh "~"; print $fh "\n"; close $output; exit;

I have got the following results in the output text file Results .txt. This is not what I want:

~A1 ~A1 ~A1 ~A1 ~A1 ~A1 ~A1 ~A1 ~A2 ~A2 ~A2 ~A2 ~A2 ~A2 ~A2 ~A2 ~A1T1 ~A1T1 ~A1T1 ~A1T1 ~A1T2 ~A1T2 ~A1T2 ~A1T2 ~A2T1 ~A2T1 ~A2T1 ~A2T1 ~A2T2 ~A2T2 ~A2T2 ~A2T2 ~A1T1A1 ~A1T1A1 ~A1T1A2 ~A1T1A2 ~A1T2A1 ~A1T2A1 ~A1T2A2 ~A1T2A2 ~A2T1A1 ~A2T1A1 ~A2T1A2 ~A2T1A2 ~A2T2A1 ~A2T2A1 ~A2T2A2 ~A2T2A2 ~A1T1A1T1 ~A1T1A1T2 ~A1T1A2T1 ~A1T1A2T2 ~A1T2A1T1 ~A1T2A1T2 ~A1T2A2T1 ~A1T2A2T2 ~A2T1A1T1 ~A2T1A1T2 ~A2T1A2T1 ~A2T1A2T2 ~A2T2A1T1 ~A2T2A1T2 ~A2T2A2T1 ~A2T2A2T2 ~A1T1A1T1G2 ~A1T1A1T2G2 ~A1T1A2T1G2 ~A1T1A2T2G2 ~A1T2A1T1G2 ~A1T2A1T2G2 ~A1T2A2T1G2 ~A1T2A2T2G2 ~A2T1A1T1G1 ~A2T1A1T2G1 ~A2T1A2T1G1 ~A2T1A2T2G1 ~A2T2A1T1G1 ~A2T2A1T2G1 ~A2T2A2T1G1 ~A2T2A2T2G1 ~A1T1A1T1G2C2 ~A1T1A1T2G2C2 ~A1T1A2T1G2C2 ~A1T1A2T2G2C2 ~A1T2A1T1G2C1 ~A1T2A1T2G2C1 ~A1T2A2T1G2C1 ~A1T2A2T2G2C1 ~A2T1A1T1G1C2 ~A2T1A1T2G1C2 ~A2T1A2T1G1C2 ~A2T1A2T2G1C2 ~A2T2A1T1G1C1 ~A2T2A1T2G1C1 ~A2T2A2T1G1C1 ~A2T2A2T2G1C1 ~A1T1A1T1G2C2G2 ~A1T1A1T2G2C2G2 ~A1T1A2T1G2C2G1 ~A1T1A2T2G2C2G1 ~A1T2A1T1G2C1G2 ~A1T2A1T2G2C1G2 ~A1T2A2T1G2C1G1 ~A1T2A2T2G2C1G1 ~A2T1A1T1G1C2G2 ~A2T1A1T2G1C2G2 ~A2T1A2T1G1C2G1 ~A2T1A2T2G1C2G1 ~A2T2A1T1G1C1G2 ~A2T2A1T2G1C1G2 ~A2T2A2T1G1C1G1 ~A2T2A2T2G1C1G1 ~A1T1A1T1G2C2G2C2 ~A1T1A1T2G2C2G2C1 ~A1T1A2T1G2C2G1C2 ~A1T1A2T2G2C2G1C1 ~A1T2A1T1G2C1G2C2 ~A1T2A1T2G2C1G2C1 ~A1T2A2T1G2C1G1C2 ~A1T2A2T2G2C1G1C1 ~A2T1A1T1G1C2G2C2 ~A2T1A1T2G1C2G2C1 ~A2T1A2T1G1C2G1C2 ~A2T1A2T2G1C2G1C1 ~A2T2A1T1G1C1G2C2 ~A2T2A1T2G1C1G2C1 ~A2T2A2T1G1C1G1C2 ~A2T2A2T2G1C1G1C1 ~

Correct results in output file Results .txt should look like:

~A1T1A1T1G2C2G2C2A?T? ~A1T1A1T2G2C2G2C1A?T? ..................... ..................... ..................... ~

For 9th & 10th place of desired results I have used ? sign to indicate unknown number.