http://www.perlmonks.org?node_id=1085446

TJCooper has asked for the wisdom of the Perl Monks concerning the following question:

I've have access to the following script:

#!/usr/bin/perl $filename = "sample"; open (TEXT, "sample.txt")||die"Cannot"; $line = " "; $count = 0; for $n (5..50) { $re = qr /[CAGT]{$n}/; $regexes[$n-5] = $re; } NEXTLINE: while ($count < 1000) { $line = <TEXT> ; $count++; foreach my $value (@regexes) { $start = 0; while ($line =~ /$value/g) { $endline = $'; $match = $&; $revmatch = reverse($match); $revmatch =~ tr/CAGT/GTCA/; if ($endline =~ /^([CAGT]{0,15})($revmatch)/) { $start = 1; $palindrome = $match . "*" . $1 . "*" . $2; $palhash{$palindrome}++; } } if ($start == 0) { goto NEXTLINE; } } } open my $out, ">/DIR/results.txt"; close TEXT; while(($key, $value) = each (%palhash)) { print $out "$key => $value\n"; } exit;

Which identifies and outputs identified DNA palindromes. A biological palindrome in DNA is defined as a sequence which when read on both strands in the same direction (5' to 3') is identical:

http://imageshack.com/a/img835/2787/de98.png (as shown by the blue/red regions)

I feel like the above script is rather messy and the output is confusing and unclear. I was wondering if anybody could offer some help, tips, guidance or code itself to accomplish the following:

(1) Identify palindromic DNA sequences

(2) Be able to specify a minimum and maximum length of match

(3) An optional parameter to set whereby only results containing a certain sequence within the length of the palindrome, for example 'GATC', are printed to the output file but where this can also be left blank causing the program to print every single palindrome it finds

(4) The inputs will be DNA sequences of only 1 strand (and not both) - the output needs to be the full palindromic sequence identified for just a single strand - for example in the above photo the input would be:

AGAGGTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGACGAGAG

The program would then calculate the other strand's sequence and see if there any palindromes contained within this and if so, output:

GTCAGTCTGCATCGTATCGATCGTCGACGATCGATACGATGCAGACTGAC

Many thanks!