Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

searching for a pattern using suffix arrays

by abualiga (Scribe)
on Oct 08, 2013 at 18:44 UTC ( #1057450=perlquestion: print w/ replies, xml ) Need Help??
abualiga has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I am learning to use suffix arrays for pattern matching, by reading Gusfield's 'Algorithms on Strings, Trees, and Sequences' - p149 "7.14 APL13: Suffix arrays - more space reduction" and p151 "7.14.2 How to search for a pattern using a suffix array". Basically, once a suffix array is populated in lexical order, a binary search is used to find the smallest suffix index such that the pattern matches the first n characters of the suffix. I am starting with the simplest approach, without any advanced accelerants. Based on my understanding, I wrote the below script, and ask for your insight.

1. Am I understanding this correctly?
2. Is this the correct implementation, ie are the results accidentally correct?

I know there are plenty tools that will do what I want, but I would like to gain better understanding of what is going on. This is not homework, or an interview question, but a growing appreciation of bioinformatics.

many thanks!

#!/usr/bin/perl use Modern::Perl '2011'; use autodie; my $str = 'mississippi$'; my @suff; # hold suffixes while ( $str ) { push @suff, $str; substr ( $str, 0, 1, '' ); } # lexically ordered suffix array my @indices = sort { $suff[$a] cmp $suff[$b] } 0 .. $#suff; shift @indices; # remove $ for ( 0 .. $#indices ) { say $indices[$_]+1, ": ", $suff[ $indices[$_] ]; } my $pattern = 'issi'; # find smallest index where pattern matches first n characters of suff +ix my $start = bsearch( \@indices, $pattern ); # get consecutive positions where pattern matches first n chars of suf +fix my @positions; for my $index ( $start .. $#indices ) { push @positions, $indices[ $index ] + 1 if $suff[ $indices[ $index + ] ] =~ /^$pattern/; } # print results say "\nPattern \"$pattern\" found at (position: suffix):"; for my $pos ( sort{ $a <=> $b } @positions ) { chop( my $s = $suff[ $pos - 1] ); say "$pos: $s"; } sub bsearch { my $mid; my ( $low, $high ) = ( 0, $#indices ); while ( 1 ) { $mid = int( ( $low + $high ) / 2 ); if( $suff[ $indices[ $mid ] ] =~ /^$pattern/ ) { return $mid; } if( $pattern lt $suff[ $indices[ $mid ] ] ) { $high = $mid - 1; } elsif( $pattern gt $suff[ $indices[ $mid ] ] ) { $low = $mid + 1; } return unless $low <= $high; } return "pattern not found in string!"; } #output abualiga:~$ ./suffixArray.pl 11: i$ 8: ippi$ 5: issippi$ 2: ississippi$ 1: mississippi$ 10: pi$ 9: ppi$ 7: sippi$ 4: sissippi$ 6: ssippi$ 3: ssissippi$ Pattern "issi" found at (position: suffix): 2: ississippi 5: issippi

Comment on searching for a pattern using suffix arrays
Download Code
Re: searching for a pattern using suffix arrays (close)
by tye (Cardinal) on Oct 08, 2013 at 19:39 UTC

    Your binary search finds an index that matches, not necessarily the first index that matches. So, yes, that was an accidental success. Your loop to search from that point on should stop early, not progress all the way to the end of the list. And your sub isn't properly encapsulated.

    Last point first, your code:

    my $start = bsearch( \@indices, $pattern );

    indicates that your sub should actually start with something more like:

    sub bsearch { my( $indices_hv, $pattern ) = @_

    It would be better design if you didn't access global variables (or file-scoped lexicals) from this sub. To do so, you'd need to pass in at last one more variable.

    You can catch this type of error by minimizing the executable code you place at the top level in your script. This is one of the advantages to the style that I still use, (tye)Re: Stupid question.

    I don't see a clever way to make a binary search end up stopping on the first match. So I'd just leave your binary search as-is and next march backward until you find the first non-match and then march forward until you find the first non-match in that direction.

    I'm guessing that sorting a list of indexes is a recommended approach for making the sorting more efficient (not moving around potentially long strings). But I bet it results in slightly slower Perl code (and more complex code as well).

    I'd also remove the '$' entry immediately after populating that list, not after sorting. Or just change the while condition to, for example:

    while( 1 < length $str ) {

    - tye        

      Greatly appreciate your input Tye!

      I accepted your corrections, and will adopt some of your style practices. I messed with the binary search sub a bit and may have stumbled upon a way to get the first index. I tried several random DNA strings, and it works. Please see below and let me know your thoughts, at your convenience.

      thanks!

      #!/usr/bin/perl use Modern::Perl '2011'; use autodie; my $str = 'atgagatacgattacagattacagatcatacgataccatacgatc$'; my @suff; # hold suffixes while ( 1 < length $str ) { # condition excludes '$' from @suff push @suff, $str; substr ( $str, 0, 1, '' ); } # lexically ordered suffix array my @indices = sort { $suff[$a] cmp $suff[$b] } 0 .. $#suff; for ( 0 .. $#indices ) { say $indices[$_]+1, ": ", $suff[ $indices[$_] ]; } my $pattern = 'tac'; # find smallest index where pattern matches first n characters of suff +ix my $start = bsearch( \@indices, $pattern ); # get consecutive positions where pattern matches first n chars of suf +fix my @positions; for my $index ( $start .. $#indices ) { if ( $suff[ $indices[ $index ] ] =~ /^$pattern/ ) { push @positions, $indices[ $index ] + 1; } else { last; } } # print results say "\nPattern \"$pattern\" found at (position: suffix):"; for my $pos ( sort{ $a <=> $b } @positions ) { chop( my $s = $suff[ $pos - 1] ); say "$pos: $s"; } sub bsearch { my ( $indref, $pat ) = @_; my $mid; my ( $lo, $hi ) = ( 0, $#$indref ); while ( 1 ) { $mid = int( ( $lo + $hi ) / 2 ); if ( $hi == $lo ) { return $mid } if( ( $pattern cmp $suff[ $indices[ $mid ] ] ) < 0 ) { $hi = $mid; } else { $lo = $mid + 1; } } return "pattern not found in string!"; } #output abualiga:~$ ./suffixArray.pl 21: acagatcatacgataccatacgatc$ 14: acagattacagatcatacgataccatacgatc$ 35: accatacgatc$ 30: acgataccatacgatc$ 40: acgatc$ 8: acgattacagattacagatcatacgataccatacgatc$ 4: agatacgattacagattacagatcatacgataccatacgatc$ 23: agatcatacgataccatacgatc$ 16: agattacagatcatacgataccatacgatc$ 33: ataccatacgatc$ 28: atacgataccatacgatc$ 38: atacgatc$ 6: atacgattacagattacagatcatacgataccatacgatc$ 43: atc$ 25: atcatacgataccatacgatc$ 1: atgagatacgattacagattacagatcatacgataccatacgatc$ 18: attacagatcatacgataccatacgatc$ 11: attacagattacagatcatacgataccatacgatc$ 45: c$ 22: cagatcatacgataccatacgatc$ 15: cagattacagatcatacgataccatacgatc$ 27: catacgataccatacgatc$ 37: catacgatc$ 36: ccatacgatc$ 31: cgataccatacgatc$ 41: cgatc$ 9: cgattacagattacagatcatacgataccatacgatc$ 3: gagatacgattacagattacagatcatacgataccatacgatc$ 32: gataccatacgatc$ 5: gatacgattacagattacagatcatacgataccatacgatc$ 42: gatc$ 24: gatcatacgataccatacgatc$ 17: gattacagatcatacgataccatacgatc$ 10: gattacagattacagatcatacgataccatacgatc$ 20: tacagatcatacgataccatacgatc$ 13: tacagattacagatcatacgataccatacgatc$ 34: taccatacgatc$ 29: tacgataccatacgatc$ 39: tacgatc$ 7: tacgattacagattacagatcatacgataccatacgatc$ 44: tc$ 26: tcatacgataccatacgatc$ 2: tgagatacgattacagattacagatcatacgataccatacgatc$ 19: ttacagatcatacgataccatacgatc$ 12: ttacagattacagatcatacgataccatacgatc$ Pattern "tac" found at (position: suffix): 7: tacgattacagattacagatcatacgataccatacgatc 13: tacagattacagatcatacgataccatacgatc 20: tacagatcatacgataccatacgatc 29: tacgataccatacgatc 34: taccatacgatc 39: tacgatc

        Yeah, I can see value in getting the binary search to find the "first" potential match; especially how you've avoided using the regex within it.

        Many decent improvements.

        bsearch() is still accessing @suff like a global variable. You may have decided that that is fine.

        I'd rewrite:

        $mid = int( ( $lo + $hi ) / 2 ); if ( $hi == $lo ) { return $mid }

        as

        return $hi if $hi == $lo; $mid = int( ( $lo + $hi ) / 2 );

        though not for any compelling reasons.

        Your bsearch() now returns the first potential match (the first suffix after or equal to your search), so you might want to adjust your output for the case of no matches.

        The return value from bsearch() might be an error message but you unconditionally use it as a number. If Modern::Perl doesn't turn on warnings, then you should probably do that.

        if ( $suff[ $indices[ $index ] ] =~ /^$pattern/ ) { push @positions, $indices[ $index ] + 1; } else { last; }

        can be shortened as:

        last if $suff[ $indices[ $index ] ] !~ /^$pattern/; push @positions, $indices[ $index ] + 1;

        - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1057450]
Approved by Old_Gray_Bear
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (9)
As of 2014-09-01 20:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (17 votes), past polls