Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

FInding the longest match from an initial match between two files

by Allie_grater (Initiate)
on Nov 08, 2016 at 21:38 UTC ( [id://1175558]=perlquestion: print w/replies, xml ) Need Help??

Allie_grater has asked for the wisdom of the Perl Monks concerning the following question:

print "Input the K value \n"; $k = <>; chomp $k; print "Input T\n"; $t = <>; chomp $t; %Qkmer = (); $i = 1; $query=' '; while ($line=<IN>) { chomp($line); if ($line=~ m/^>/ ) { next; } $query=$query.$line; $line=~ s/(^|\n)[\n\s]*/$1/g; while (length($line) >= $k) { $line =~ m/(.{$k})/; if (! defined $Qkmer{$1}) {#every key not deined as the first match $Qkmer{$1} = $i; } $i++; $line = substr($line, 1, length($line) -1); } } open(MYDATA, '<', "data.txt"); while ($line=<MYDATA>) { \ chomp($line); %Skmer = (); # This initializes the hash called Skmer. $j = 1; if ($line=~ m/^>/ ) { #if the line starts with > next; #start on next line #separated characters } $line=~ s/^\s+|\s+$//g ; #remove all spaces from file while (length($line) >= $k) { $line =~ m/(.{$k})/;#match any k characters and only k characters +in dna $Skmer{$1} = $j; #set the key position to $j and increase for each + new key $j++; $line = substr($line, 1, length($line) -1); #this removes the firs +t character in the current string } ###(56)###for($Skmerkey(keys %Skmer)){ $i=$Skmer{$Skmerkey}; if(defined $Qkmer($Skmerkey)){ $j=$Qkmer($Skmerkey); } $S1=$line; $S2=$query; @arrayS1= split(//, $S1); @array2= split(//, $S2); $l=0; while($arrayS1[$i-$l] eq $arrayS2[$j-$l]){ $l++; } $start=$i-$l; $m=0; while ($arrayS1[$i+$k+$m] eq $arrayS2[$j+$k+$m]) { $m++; } $length=$l+$k+$m; $match= substr($S1, $start, $length); if($length>$t){ $longest=length($match); print "Longest: $match of length $longest \n"; } } }###(83)###
The input files contain only strings of letters. For example. From a match of a word of length $k in file 1 in file 2, I check from that match in file 2 to left and to right of word for further matches. The final output is the longest match between File 1 and File 2 based on $k. Now I get With this code, I get a syntax error and I am not sure why because it looks correct to me: ``` syntax error at testk.pl line 56, near "$Skmerkey(" syntax error at testk.pl line 83, near "}" ``` Thank you.

Replies are listed 'Best First'.
Re: FInding the longest match from an initial match between two files
by GrandFather (Saint) on Nov 08, 2016 at 21:59 UTC

    Your for loop syntax is broken. It should be:

    for my $Skmerkey (keys %Skmer) {

    In several places you use () instead of {} for accessing hash values. The code should look like:

    my $i = $Skmer{$Skmerkey};

    I strongly recommend you use strictures (use strict; use warnings; - see The strictures, according to Seuss). You initialize the array @array2 but access $arrayS2[...] in several places. Strictures would have found that for you immediately.

    Your first while loop references <IN>, but nothing opens IN as a file handle.

    \chomp($line) is incorrect. Remove the \.

    Premature optimization is the root of all job security
      After fixing all the issues with ()/{}, the error is now with an uninitialized value in string eq at line 73 "$m++;" but I am thinking that perhaps I need to expand the if loop that contains the arrays to include $l and $m. As for the <IN>, doesn't reading into $query allow me to access the string outside of the first while loop?

        Your uninitialized value is most likely due to trying to access an element beyond the end of one of the two arrays. Consider what happens when the content of the two arrays is identical.

        I haven't looked at the logic of your code in any detail, but my impression is that you are not taking any advantage of the string processing power that Perl provides. In particular splitting strings up into arrays of characters smells really bad.

        Premature optimization is the root of all job security
Re: FInding the longest match from an initial match between two files
by tybalt89 (Monsignor) on Nov 09, 2016 at 00:41 UTC

    Is this the kind of thing you are doing?

    #!/usr/bin/perl -l use strict; use warnings; my $k = 5; my $file1contents = 'TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAAC +ACCATCAT'; my $file2contents = 'ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTA +CTATACTA'; $_ = "$file1contents\n$file2contents"; print "at position $-[0] is match $1" while /(.{$k,}) (?= .* \n .* \1 +)/gx;
      Seq1: TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAACACCATCAT
      Seq2: ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTACTATACTA

      Seq1: TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAACACCATCAT
      Seq2: ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTACTATACTA

      The above is the line1 and line2 from your sequence sample. The first shows in red and blue 2 matches from the regex.

      In the second identical set, you can see (in red), a match which is 1 character longer than the longest match (in red, above).

      My question is why the regex made 2 captures here instead of the optimal match in the second (10 chars instead of 9).

      The code which accidentally found this was:

      my $xor = $file1contents ^ $file2contents; my $max = 0; my $max_str; my $pos; while ($xor =~ /(\0+)/g) { my $len = length $1; if ($len > $max) { $max = $len; $max_str = substr $file1contents, $-[0], $len; $pos = $-[0]; } #print "matched $-[0] ", substr $file1contents, $-[0], $+[0] - $-[ +0]; } print "at pos $pos max string is $max_str";

        It's a question of whether overlapping matches are wanted or not. The code I posted in Re: FInding the longest match from an initial match between two files deliberately did not look for overlapping matches.

        If overlapping matches are wanted, the regex could be changed to the following:

        #!/usr/bin/perl -l use strict; use warnings; my $k = 5; my $file1contents = 'TACATCTCAAAACACTTTCATCTCACGACTACTACTACTACTTCAAAAC +ACCATCAT'; my $file2contents = 'ACTTCAACATAACTACTATATACTACTCATACTACTACTCTTAAAACTA +CTATACTA'; $_ = "$file1contents\n$file2contents"; print "at position $-[0] is match $1" while /(?= (.{$k,}) .* \n .* \1 +)/gx;

        And the output from this change is:

        at position 8 is match AAAAC at position 27 is match ACTACTACT at position 28 is match CTACTACT at position 29 is match TACTACTACT at position 30 is match ACTACTACT at position 31 is match CTACTACT at position 32 is match TACTACTACT at position 33 is match ACTACTACT at position 34 is match CTACTACT at position 35 is match TACTACT at position 36 is match ACTACT at position 37 is match CTACT at position 39 is match ACTTCAA at position 40 is match CTTCAA at position 41 is match TTCAA at position 44 is match AAAAC

        which shows the longer match you found (in fact, two of them, partially overlapping).

        It all depends on what the output is going to be used for, I suppose. One of the reasons I posted the code was to prompt discussion about the problem.

Re: FInding the longest match from an initial match between two files
by tybalt89 (Monsignor) on Nov 08, 2016 at 21:51 UTC
    for($Skmerkey(keys %Skmer)){

    What is $Skmerkey ? A hash ? Use {}. An array ? Use []. A code ref ? Use ->(...).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1175558]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-18 08:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found