Hi there,
I am learning to use suffix arrays for pattern matching, by reading Gusfield's 'Algorithms on Strings, Trees, and Sequences' - p149 "7.14 APL13: Suffix arrays - more space reduction" and p151 "7.14.2 How to search for a pattern using a suffix array". Basically, once a suffix array is populated in lexical order, a binary search is used to find the smallest suffix index such that the pattern matches the first n characters of the suffix. I am starting with the simplest approach, without any advanced accelerants. Based on my understanding, I wrote the below script, and ask for your insight.
1. Am I understanding this correctly?
2. Is this the correct implementation, ie are the results accidentally correct?
I know there are plenty tools that will do what I want, but I would like to gain better understanding of what is going on. This is not homework, or an interview question, but a growing appreciation of bioinformatics.
many thanks!
#!/usr/bin/perl
use Modern::Perl '2011';
use autodie;
my $str = 'mississippi$';
my @suff; # hold suffixes
while ( $str ) {
push @suff, $str;
substr ( $str, 0, 1, '' );
}
# lexically ordered suffix array
my @indices = sort { $suff[$a] cmp $suff[$b] } 0 .. $#suff;
shift @indices; # remove $
for ( 0 .. $#indices ) {
say $indices[$_]+1, ": ", $suff[ $indices[$_] ];
}
my $pattern = 'issi';
# find smallest index where pattern matches first n characters of suff
+ix
my $start = bsearch( \@indices, $pattern );
# get consecutive positions where pattern matches first n chars of suf
+fix
my @positions;
for my $index ( $start .. $#indices ) {
push @positions, $indices[ $index ] + 1 if $suff[ $indices[ $index
+ ] ] =~ /^$pattern/;
}
# print results
say "\nPattern \"$pattern\" found at (position: suffix):";
for my $pos ( sort{ $a <=> $b } @positions ) {
chop( my $s = $suff[ $pos - 1] );
say "$pos: $s";
}
sub bsearch {
my $mid;
my ( $low, $high ) = ( 0, $#indices );
while ( 1 ) {
$mid = int( ( $low + $high ) / 2 );
if( $suff[ $indices[ $mid ] ] =~ /^$pattern/ ) {
return $mid;
}
if( $pattern lt $suff[ $indices[ $mid ] ] ) {
$high = $mid - 1;
}
elsif( $pattern gt $suff[ $indices[ $mid ] ] ) {
$low = $mid + 1;
}
return unless $low <= $high;
}
return "pattern not found in string!";
}
#output
abualiga:~$ ./suffixArray.pl
11: i$
8: ippi$
5: issippi$
2: ississippi$
1: mississippi$
10: pi$
9: ppi$
7: sippi$
4: sissippi$
6: ssippi$
3: ssissippi$
Pattern "issi" found at (position: suffix):
2: ississippi
5: issippi