abualiga has asked for the wisdom of the Perl Monks concerning the following question:
Hi there,
I am learning to use suffix arrays for pattern matching, by reading Gusfield's 'Algorithms on Strings, Trees, and Sequences' - p149 "7.14 APL13: Suffix arrays - more space reduction" and p151 "7.14.2 How to search for a pattern using a suffix array". Basically, once a suffix array is populated in lexical order, a binary search is used to find the smallest suffix index such that the pattern matches the first n characters of the suffix. I am starting with the simplest approach, without any advanced accelerants. Based on my understanding, I wrote the below script, and ask for your insight.
1. Am I understanding this correctly?
2. Is this the correct implementation, ie are the results accidentally correct?
I know there are plenty tools that will do what I want, but I would like to gain better understanding of what is going on. This is not homework, or an interview question, but a growing appreciation of bioinformatics.
many thanks!
#!/usr/bin/perl use Modern::Perl '2011'; use autodie; my $str = 'mississippi$'; my @suff; # hold suffixes while ( $str ) { push @suff, $str; substr ( $str, 0, 1, '' ); } # lexically ordered suffix array my @indices = sort { $suff[$a] cmp $suff[$b] } 0 .. $#suff; shift @indices; # remove $ for ( 0 .. $#indices ) { say $indices[$_]+1, ": ", $suff[ $indices[$_] ]; } my $pattern = 'issi'; # find smallest index where pattern matches first n characters of suff +ix my $start = bsearch( \@indices, $pattern ); # get consecutive positions where pattern matches first n chars of suf +fix my @positions; for my $index ( $start .. $#indices ) { push @positions, $indices[ $index ] + 1 if $suff[ $indices[ $index + ] ] =~ /^$pattern/; } # print results say "\nPattern \"$pattern\" found at (position: suffix):"; for my $pos ( sort{ $a <=> $b } @positions ) { chop( my $s = $suff[ $pos - 1] ); say "$pos: $s"; } sub bsearch { my $mid; my ( $low, $high ) = ( 0, $#indices ); while ( 1 ) { $mid = int( ( $low + $high ) / 2 ); if( $suff[ $indices[ $mid ] ] =~ /^$pattern/ ) { return $mid; } if( $pattern lt $suff[ $indices[ $mid ] ] ) { $high = $mid - 1; } elsif( $pattern gt $suff[ $indices[ $mid ] ] ) { $low = $mid + 1; } return unless $low <= $high; } return "pattern not found in string!"; } #output abualiga:~$ ./suffixArray.pl 11: i$ 8: ippi$ 5: issippi$ 2: ississippi$ 1: mississippi$ 10: pi$ 9: ppi$ 7: sippi$ 4: sissippi$ 6: ssippi$ 3: ssissippi$ Pattern "issi" found at (position: suffix): 2: ississippi 5: issippi
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: searching for a pattern using suffix arrays (close)
by tye (Sage) on Oct 08, 2013 at 19:39 UTC | |
by abualiga (Scribe) on Oct 09, 2013 at 02:19 UTC | |
by tye (Sage) on Oct 09, 2013 at 02:46 UTC | |
by abualiga (Scribe) on Oct 09, 2013 at 19:00 UTC |