FInding the longest match from an initial match between two files

Allie_grater has asked for the wisdom of the Perl Monks concerning the following question:

print "Input the K value \n";
$k = <>;
chomp $k;

print "Input T\n";
$t = <>;
chomp $t;

%Qkmer = ();                      
$i = 1;

$query=' ';
while ($line=<IN>) {
chomp($line);
 if ($line=~ m/^>/ ) {
 next;
}
$query=$query.$line;
$line=~ s/(^|\n)[\n\s]*/$1/g;

 while (length($line) >= $k) {
   $line =~ m/(.{$k})/;
   if (! defined $Qkmer{$1}) {#every key not deined as the first match
     $Qkmer{$1} = $i;
   }
   $i++;
   $line = substr($line, 1, length($line) -1);
 }
}

open(MYDATA, '<', "data.txt");

while ($line=<MYDATA>) { \
  chomp($line);
  %Skmer = ();           # This initializes the hash called Skmer.
  $j = 1;

  if ($line=~ m/^>/ ) { #if the line starts with >
    next; #start on next line #separated characters
  }
  $line=~ s/^\s+|\s+$//g ; #remove all spaces from file
  while (length($line) >= $k) {
    $line =~ m/(.{$k})/;#match any k characters and only k characters 
+in dna
    $Skmer{$1} = $j; #set the key position to $j and increase for each
+ new key
    $j++;
    $line = substr($line, 1, length($line) -1); #this removes the firs
+t character in the current string
  }

  ###(56)###for($Skmerkey(keys %Skmer)){
    $i=$Skmer{$Skmerkey};
    if(defined $Qkmer($Skmerkey)){
      $j=$Qkmer($Skmerkey);
      }
      $S1=$line;
      $S2=$query;
      @arrayS1= split(//, $S1);
      @array2= split(//, $S2);

      $l=0;
      while($arrayS1[$i-$l] eq $arrayS2[$j-$l]){
        $l++;
      }
      $start=$i-$l;
      $m=0;
      while ($arrayS1[$i+$k+$m] eq $arrayS2[$j+$k+$m]) {
        $m++;
      }
      $length=$l+$k+$m;
      $match= substr($S1, $start, $length);

      if($length>$t){
        $longest=length($match);
        print "Longest: $match of length $longest \n";
      }
  }

}###(83)###
[download]

The input files contain only strings of letters. For example. From a match of a word of length $k in file 1 in file 2, I check from that match in file 2 to left and to right of word for further matches. The final output is the longest match between File 1 and File 2 based on $k. Now I get With this code, I get a syntax error and I am not sure why because it looks correct to me: ``` syntax error at testk.pl line 56, near "$Skmerkey(" syntax error at testk.pl line 83, near "}" ``` Thank you.

Comment on FInding the longest match from an initial match between two files Download Code

Replies are listed 'Best First'.
Re: FInding the longest match from an initial match between two files by GrandFather (Saint) on Nov 08, 2016 at 21:59 UTC
Your for loop syntax is broken. It should be: `for my $Skmerkey (keys %Skmer) {` [download] In several places you use () instead of {} for accessing hash values. The code should look like: `my $i = $Skmer{$Skmerkey};` [download] I strongly recommend you use strictures (use strict; use warnings; - see The strictures, according to Seuss). You initialize the array `@array2` but access `$arrayS2[...]` in several places. Strictures would have found that for you immediately. Your first while loop references `<IN>`, but nothing opens IN as a file handle. `\chomp($line)` is incorrect. Remove the \. Premature optimization is the root of all job security	[reply] [d/l] [select]
Re^2: FInding the longest match from an initial match between two files by Allie_grater (Initiate) on Nov 08, 2016 at 22:21 UTC
After fixing all the issues with ()/{}, the error is now with an uninitialized value in string eq at line 73 "$m++;" but I am thinking that perhaps I need to expand the if loop that contains the arrays to include $l and $m. As for the <IN>, doesn't reading into $query allow me to access the string outside of the first while loop?	[reply]
Re^3: FInding the longest match from an initial match between two files by GrandFather (Saint) on Nov 08, 2016 at 23:25 UTC
Your uninitialized value is most likely due to trying to access an element beyond the end of one of the two arrays. Consider what happens when the content of the two arrays is identical. I haven't looked at the logic of your code in any detail, but my impression is that you are not taking any advantage of the string processing power that Perl provides. In particular splitting strings up into arrays of characters smells really bad. Premature optimization is the root of all job security	[reply]
Re: FInding the longest match from an initial match between two files by tybalt89 (Monsignor) on Nov 09, 2016 at 00:41 UTC
Is this the kind of thing you are doing? `#!/usr/bin/perl -l use strict; use warnings; my $k = 5; my $file1contents = 'TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAAC +ACCATCAT'; my $file2contents = 'ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTA +CTATACTA'; $_ = "$file1contents\n$file2contents"; print "at position $-[0] is match $1" while /(.{$k,}) (?= .* \n .* \1 +)/gx;` [download]	[reply] [d/l]
Re^2: FInding the longest match from an initial match between two files by Cristoforo (Curate) on Nov 09, 2016 at 17:46 UTC
Seq1: TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAACACCATCAT Seq2: ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTACTATACTA Seq1: TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAACACCATCAT Seq2: ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTACTATACTA The above is the line1 and line2 from your sequence sample. The first shows in red and blue 2 matches from the regex. In the second identical set, you can see (in red), a match which is 1 character longer than the longest match (in red, above). My question is why the regex made 2 captures here instead of the optimal match in the second (10 chars instead of 9). The code which accidentally found this was: `my $xor = $file1contents ^ $file2contents; my $max = 0; my $max_str; my $pos; while ($xor =~ /(\0+)/g) { my $len = length $1; if ($len > $max) { $max = $len; $max_str = substr $file1contents, $-[0], $len; $pos = $-[0]; } #print "matched $-[0] ", substr $file1contents, $-[0], $+[0] - $-[ +0]; } print "at pos $pos max string is $max_str";` [download]	[reply] [d/l]
Re^3: FInding the longest match from an initial match between two files by tybalt89 (Monsignor) on Nov 09, 2016 at 18:51 UTC
It's a question of whether overlapping matches are wanted or not. The code I posted in Re: FInding the longest match from an initial match between two files deliberately did not look for overlapping matches. If overlapping matches are wanted, the regex could be changed to the following: `#!/usr/bin/perl -l use strict; use warnings; my $k = 5; my $file1contents = 'TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAAC +ACCATCAT'; my $file2contents = 'ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTA +CTATACTA'; $_ = "$file1contents\n$file2contents"; print "at position $-[0] is match $1" while /(?= (.{$k,}) .* \n .* \1 +)/gx;` [download] And the output from this change is: at position 8 is match AAAAC at position 27 is match ACTACTACT at position 28 is match CTACTACT at position 29 is match TACTACTACT at position 30 is match ACTACTACT at position 31 is match CTACTACT at position 32 is match TACTACTACT at position 33 is match ACTACTACT at position 34 is match CTACTACT at position 35 is match TACTACT at position 36 is match ACTACT at position 37 is match CTACT at position 39 is match ACTTCAA at position 40 is match CTTCAA at position 41 is match TTCAA at position 44 is match AAAAC [download] which shows the longer match you found (in fact, two of them, partially overlapping). It all depends on what the output is going to be used for, I suppose. One of the reasons I posted the code was to prompt discussion about the problem.	[reply] [d/l] [select]
Re: FInding the longest match from an initial match between two files by tybalt89 (Monsignor) on Nov 08, 2016 at 21:51 UTC
`for($Skmerkey(keys %Skmer)){` [download] What is $Skmerkey ? A hash ? Use {}. An array ? Use []. A code ref ? Use ->(...).	[reply] [d/l]


Just another Perl shrine
	PerlMonks