Your skill will accomplishwhat the force of many cannot PerlMonks

### Pattern Searching

by aseee (Novice)
 on Nov 10, 2012 at 05:38 UTC Need Help??
aseee has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all Monks,

I am trying to find the common patterns in different strings. I used the KMP algorithm. It works best for a single string but show almost nothing for the same string in next iterator of loop, Here is my code

```use CGI qw(:standard);
use strict;

BEGIN{ push @INC,'C:/src/String-1.5','C:/src/bioperl-live','C:/src/ens
+embl/modules','C:/src/ensembl-compara/modules','C:/src/ensembl-variat
+ion/modules','C:/src/ensembl-functgenomics/modules','C:/src/DBD-mysql
+-4.021';};

use strict;
use warnings;
use string;
my \$T_one='GAATTCACGGATAGCCWGGTACGATGATAGATGAATTCGACTAGAATGCCWGGAAGAAT
+gGAATTC';
my \$Ta='GAATTCACGGATAGCCWGGTACGATGATAGATGAATTCGACTAGAATGCCWGGAAGAATgGA
+ATTC';
my \$rr='GAATTCACGGATAGCCWGGTACGATGATAGATGAATTCGACTAGAATGCCWGGAAGAATgGA
+ATTC';
my @P=('GAATTC','CCWGG');
my @next;
my @loc;
my \$m;
my \$l=0;
my \$lt;my \$T;
my @text=(0);
my @tt=(\$T_one,\$Ta,\$rr);

for(my \$i=0;\$i<3;\$i++){  \$T=\$tt[\$i];

@loc=();
print "<\br>";

@text=();
my \$str = new String(\$T);   # USE  OF STRING MODULE TO FETCH
+THE SEQUENCE
print  "length of gene sequence array     ", \$lt= \$str->length
+, "\n";    # LENGTH OF THE SEQUENCE
\$lt= \$str->length;
my \$z=0;                         # FOR ARRAY INCREMENT
for(my \$k=0; \$k<\$lt;\$k++){
\$text[\$z]=\$str->charAt(\$k);   #ASSIGNING THE ARRAY
+OBJECT VALUES TO AN ARRAY
\$z++;
} #END OF FOR LOOP

foreach my \$P(@P){
print knuth_morris_pratt(\$T,\$P);
print"\n";
}

@loc = sort {\$a <=> \$b} @loc;
print"</br>";print "@loc";print "</br>";

my \$i=0;
for(my \$k=0;\$k<\$lt;\$k++){
print \$i;
if (\$k == \$loc[\$i]){
print "<span style=background-color:re
+d;>" .  \$text[\$k] . "</span>";
\$i++;

}
else
{ print \$text[\$k];

}
}  # end of \$k for loop

#print"\n";print "@loc";print "<\n>";
# FOR ARRAY INCREMENT
print "</br>";
}   #end of \$T for each loop

sub knuth_morris_pratt_next {
my ( \$P ) = @_; # The pattern.
use integer;
\$m=length \$P;
my (\$i, \$j ) = (  0, -1 );

for (\$next[0] = -1; \$i < \$m; ) {
# Note that this while() is skipped during the first for() pas
+s.
while ( \$j > -1 &&
substr( \$P, \$i, 1 ) ne substr( \$P, \$j, 1 ) ) {
\$j = \$next[ \$j ];
}
\$i++;
\$j++;
\$next[ \$i ] =
substr( \$P, \$j, 1 ) eq substr( \$P, \$i, 1 ) ?
\$next[ \$j ] : \$j;
}
return ( \$m, @next ); # Length of pattern and prefix function.
}

#######################################
sub knuth_morris_pratt {
my ( \$T, \$P ) = @_; # Text and pattern.
use integer;
my( \$m, @next) = knuth_morris_pratt_next( \$P );
my ( \$n, \$i, \$j ) = ( length(\$T), 0, 0 );

while ( \$i < \$n ) {
while ( \$j > -1 && substr( \$P, \$j, 1 ) ne substr( \$T, \$i, 1 )
+) {
\$j = \$next[ \$j ];
}
\$i++;
\$j++;

if(\$j >= \$m){  my \$a=\$i-\$j;

#    \$j=\$j-1;print "\n";

\$j=\$next[\$j];

for(my \$z=0;\$z<\$m;\$z++){
\$loc[\$l]=\$a;
\$l++;
\$a++;
}

}

}
return ; # Mismatch.

}
#################################

I am new here so apology for any mistake. I am a biologist doing a project in perl.Please Monks help me

Replies are listed 'Best First'.
Re: Pattern Searching
by marquezc329 (Scribe) on Nov 10, 2012 at 07:31 UTC
Hello and welcome, aseee.

You posted a rather large amount of code for a broad question. If you can show expected output and tell where exactly you are having trouble it will be much easier to provide a helpful answer. Also, I'd suggest having a look through perlintro and perlstyle. Cleaning up your code a bit will go a long way towards helping us help you. Not to mention, a strong understanding of the basics will better equip you to answer your own questions in the future.

**UPDATE**

Since you didn't exactly point out where your problem lies, I'm going to guess it is most likely in the maze of single char variable names and unformatted loops. I would suggest cleaning up your algorithm subs. The added clarity in conjunction with a slow review may yield the solution to your problem. Also, the Knuth-Morris-Pratt algorithm is covered on page 370 in Mastering Algorithms with Perl by Jarkko Hietaniemi, John Macdonald, and Jon Orwant. There is a link to view this material online in this node Knuth-Morris-Pratt Vs. Perl. I suggest you contemplate rewriting your code using this material as a guide, and implore you to research more thoroughly before asking for help. Finding your own answers is infinitely more rewarding ( I myself learned what the KMP algorithm is and how it works by researching Your question tonight ;).

I took a stab at cleaning up some of your code. Some notes to keep in mind:

Incremented for loops i.e. for (\$i = 0; \$i <10; \$i++) { ... } are easily changed to for (0..9) { ... }

White space can be key in clarity. Try to keep a consistent indentation theme. i.e:

```for (0..9) {
print "line1\n";
print "line2\n";
if (\$cond =~ /pattern/i) {
do something;
}
}

Here is a revised (UNTESTED) sample of your opening for loop.

```
my @tt=(\$T_one,\$Ta,\$rr);

foreach my \$string (@tt) {
@loc=();
@text=();

my \$strLength = length(\$string);
my @stringArr = split //, \$string;

print "<\br>";
print "Length of Gene Sequence Array:    \$strLength\n";

foreach my \$pattern (@patterns){
print knuth_morris_pratt(\$string, \$pattern), "\n";
}

@loc = sort {\$a <=> \$b} @loc;

print "</br>";
print "@loc";
print "</br>";

my \$i=0;
foreach my \$k (0 .. (\$strLength-1)) {
print \$i;
if (\$k == \$loc[\$i]){
print "<span style=background-color:red;>\$text[\$k]</span>"
+;
\$i++;
} else {
print \$text[\$k];
}
}

print "</br>";

}

well thanks for your reply. In this code the KMP algorithm is implemented. A pattern and the string are passed to the KMP function named as knuth_morris_pratt.It returns the location of patterns, store in @loc array. Then the string is displayed with a colored background where the patterns occur. Code for this is

```
my \$ii=1;                        for(my \$k=1;\$k<\$lt;\$k++){
if (\$k == \$loc[\$ii]){
print "<span style=background-color:red;>" .\$text[\$k] . "</span>";
\$ii++; }
}

The process is repeated for all strings stored in the array. This code works well in first iteration of for loop but it does not show the same result in the 2nd iteration for the same string, stored at the next index of array. The output of this code is

```<r>length of gene sequence array 66
0 1 2 3 4 5 14 15 16 17 18 32 33 34 35 36 37 48 49 50 51 52 60 61 62 6
+3 64 65
GAATTCCCWGGGAATTCCCWGGGAATTC
<r>length of gene sequence array 66
0 1 2 3 4 5 14 15 16 17 18 32 33 34 35 36 37 48 49 50 51 52 60 61 62 6
+3 64 65
G
<r>length of gene sequence array 66

0 1 2 3 4 5 14 15 16 17 18 32 33 34 35 36 37 48 49 50 51 52 60 61 62 6
+3 64 65
G

I want

< r>length of gene sequence array 66
0 1 2 3 4 5 14 15 16 17 18 32 33 34 35 36 37 48 49 50 51 52 60 61 62 6
+3 64 65
GAATTCCCWGGGAATTCCCWGGGAATTC
< r>length of gene sequence array 66
0 1 2 3 4 5 14 15 16 17 18 32 33 34 35 36 37 48 49 50 51 52 60 61 62 6
+3 64 65
GAATTCCCWGGGAATTCCCWGGGAATTC
< r>length of gene sequence array 66
0 1 2 3 4 5 14 15 16 17 18 32 33 34 35 36 37 48 49 50 51 52 60 61 62 6
+3 64 65
GAATTCCCWGGGAATTCCCWGGGAATTC

The numbers shows the position of patterns that is same for the same three strings but the patterns are not displayed for the last two strings. The problem is in displaying the patterns in 2nd and 3rd iteration of main for loop.

Just a note to congratulate you for putting the research into your answer. I upvoted your answer but thought just doing that wasn't enough given how you put the effort into giving a comprehensive reply

A Monk aims to give answers to those who have none, and to learn from those who know more.

I appreciate your efforts in solving the problem. your code have the same problem i.e. works well for the 1st iteration, foreach loop of \$string then it shows only first letter of patterns for rest of strings.

I figured as much. The snippet wasn't intended to debug, merely to give an example rewrite implementing some of the stylistic points I mentioned.

I'd suggest taking a look at perldebtut. Using the Perl debugger to step through your code may help you visualize the flow of your program through each control structure, and figure out where it may be exiting prematurely on the 2nd and 3rd iteration.

Re: Pattern Searching
by grondilu (Friar) on Nov 10, 2012 at 12:18 UTC

Common pattern in different strings? Using KMP?

I think I solved the corresponding problem on rosalind.info. Here is my solution:

```#!/usr/bin/perl
use strict;
use warnings;

sub kmp {
my @T = (-1, 0);
my (\$pos, \$cnd) = (2, 0);
while ( \$pos <= @_ ) {
if (\$_[\$pos-1] eq \$_[\$cnd]) { \$T[\$pos++] = ++\$cnd }
elsif (\$cnd > 0) { \$cnd = \$T[\$cnd] // 0 }
else { \$T[\$pos++] = 0 }
}
shift @T;
return @T;
}

use List::Util qw(max min);
sub lcs {
my @first = split //, my \$first = shift;
my \$n = @first;
my @string = map [ split //, \$_ ], @_;
my (\$pos, \$length) = (0, 0);
SUFFIX: for my \$p ( 0 .. \$n - 1 ) {
shift @first if \$p;
my @max;
for (@string) {
my @kmp = kmp @first, '\$', @\$_;
my \$max = max @kmp;
next SUFFIX if \$max < \$length;
push @max, \$max;
}
if ((my \$min = min @max) > \$length) {
(\$pos, \$length) = (\$p, \$min);
}
}
return substr \$first, \$pos, \$length;
}

open my \$f, '< rosalind_lcs.txt' or die "could not open: \$!";
say lcs <\$f>;

Hope this will help

hey Monkeys I need more support. I am trying this from 10 hours. I have to show results to my supervisor. I try my level best and still trying. The problem is in printing the @loc array in 2nd iteration. I try my level best to debug this. What I understand is that in 2nd iteration the condition is true for only one time where as the @loc array values shows that the condition should be true for all values.I tried to fix it but all results into nothing. Please help me.

sorry the word is PerlMonks not monkeys. Extremely sorry.
Re: Pattern Searching
by CountZero (Bishop) on Nov 10, 2012 at 21:19 UTC
I must be missing something obvious, but I can reproduce the basics of your output by a few nested loops and a regex.
```use Modern::Perl;

my @patterns = qw/GAATTC CCWGG/;

while ( my \$sequence = <DATA> ) {
my @matches;
while ( \$sequence =~ m/(\$patterns[0]|\$patterns[1])/g ) {
for my \$expr ( 1 .. \$#- ) {
push @matches, \$-[\$expr] .. \$+[\$expr] - 1;
}
}
say "@matches";
}

__DATA__
GAATTCACGGATAGCCWGGTACGATGATAGATGAATTCGACTAGAATGCCWGGAAGAATgGAATTC
GAATTCACGGATAGCCWGGTACGATGATAGATGAATTCGACTAGAATGCCWGGAAGAATgGAATTC
GAATTCACGGATAGCCWGGTACGATGATAGATGAATTCGACTAGAATGCCWGGAAGAATgGAATTC
Output:
```0 1 2 3 4 5 14 15 16 17 18 32 33 34 35 36 37 48 49 50 51 52 60 61 62 6
+3 64 65
0 1 2 3 4 5 14 15 16 17 18 32 33 34 35 36 37 48 49 50 51 52 60 61 62 6
+3 64 65
0 1 2 3 4 5 14 15 16 17 18 32 33 34 35 36 37 48 49 50 51 52 60 61 62 6
+3 64 65

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics
Thanks to all of you. I have solved my problem.I learn many things from you people. Now I have to call java code by using the Perl. Can you provide me some example or a link which thoroughly cover this.
لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Create A New User
Node Status?
node history
Node Type: perlquestion [id://1003232]
Approved by davido
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2017-08-17 04:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
Who is your favorite scientist and why?

Results (280 votes). Check out past polls.

Notices?