Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: searching for a pattern using suffix arrays (close)

by tye (Cardinal)
on Oct 08, 2013 at 19:39 UTC ( #1057453=note: print w/ replies, xml ) Need Help??


in reply to searching for a pattern using suffix arrays

Your binary search finds an index that matches, not necessarily the first index that matches. So, yes, that was an accidental success. Your loop to search from that point on should stop early, not progress all the way to the end of the list. And your sub isn't properly encapsulated.

Last point first, your code:

my $start = bsearch( \@indices, $pattern );

indicates that your sub should actually start with something more like:

sub bsearch { my( $indices_hv, $pattern ) = @_

It would be better design if you didn't access global variables (or file-scoped lexicals) from this sub. To do so, you'd need to pass in at last one more variable.

You can catch this type of error by minimizing the executable code you place at the top level in your script. This is one of the advantages to the style that I still use, (tye)Re: Stupid question.

I don't see a clever way to make a binary search end up stopping on the first match. So I'd just leave your binary search as-is and next march backward until you find the first non-match and then march forward until you find the first non-match in that direction.

I'm guessing that sorting a list of indexes is a recommended approach for making the sorting more efficient (not moving around potentially long strings). But I bet it results in slightly slower Perl code (and more complex code as well).

I'd also remove the '$' entry immediately after populating that list, not after sorting. Or just change the while condition to, for example:

while( 1 < length $str ) {

- tye        


Comment on Re: searching for a pattern using suffix arrays (close)
Select or Download Code
Re^2: searching for a pattern using suffix arrays (close)
by abualiga (Scribe) on Oct 09, 2013 at 02:19 UTC

    Greatly appreciate your input Tye!

    I accepted your corrections, and will adopt some of your style practices. I messed with the binary search sub a bit and may have stumbled upon a way to get the first index. I tried several random DNA strings, and it works. Please see below and let me know your thoughts, at your convenience.

    thanks!

    #!/usr/bin/perl use Modern::Perl '2011'; use autodie; my $str = 'atgagatacgattacagattacagatcatacgataccatacgatc$'; my @suff; # hold suffixes while ( 1 < length $str ) { # condition excludes '$' from @suff push @suff, $str; substr ( $str, 0, 1, '' ); } # lexically ordered suffix array my @indices = sort { $suff[$a] cmp $suff[$b] } 0 .. $#suff; for ( 0 .. $#indices ) { say $indices[$_]+1, ": ", $suff[ $indices[$_] ]; } my $pattern = 'tac'; # find smallest index where pattern matches first n characters of suff +ix my $start = bsearch( \@indices, $pattern ); # get consecutive positions where pattern matches first n chars of suf +fix my @positions; for my $index ( $start .. $#indices ) { if ( $suff[ $indices[ $index ] ] =~ /^$pattern/ ) { push @positions, $indices[ $index ] + 1; } else { last; } } # print results say "\nPattern \"$pattern\" found at (position: suffix):"; for my $pos ( sort{ $a <=> $b } @positions ) { chop( my $s = $suff[ $pos - 1] ); say "$pos: $s"; } sub bsearch { my ( $indref, $pat ) = @_; my $mid; my ( $lo, $hi ) = ( 0, $#$indref ); while ( 1 ) { $mid = int( ( $lo + $hi ) / 2 ); if ( $hi == $lo ) { return $mid } if( ( $pattern cmp $suff[ $indices[ $mid ] ] ) < 0 ) { $hi = $mid; } else { $lo = $mid + 1; } } return "pattern not found in string!"; } #output abualiga:~$ ./suffixArray.pl 21: acagatcatacgataccatacgatc$ 14: acagattacagatcatacgataccatacgatc$ 35: accatacgatc$ 30: acgataccatacgatc$ 40: acgatc$ 8: acgattacagattacagatcatacgataccatacgatc$ 4: agatacgattacagattacagatcatacgataccatacgatc$ 23: agatcatacgataccatacgatc$ 16: agattacagatcatacgataccatacgatc$ 33: ataccatacgatc$ 28: atacgataccatacgatc$ 38: atacgatc$ 6: atacgattacagattacagatcatacgataccatacgatc$ 43: atc$ 25: atcatacgataccatacgatc$ 1: atgagatacgattacagattacagatcatacgataccatacgatc$ 18: attacagatcatacgataccatacgatc$ 11: attacagattacagatcatacgataccatacgatc$ 45: c$ 22: cagatcatacgataccatacgatc$ 15: cagattacagatcatacgataccatacgatc$ 27: catacgataccatacgatc$ 37: catacgatc$ 36: ccatacgatc$ 31: cgataccatacgatc$ 41: cgatc$ 9: cgattacagattacagatcatacgataccatacgatc$ 3: gagatacgattacagattacagatcatacgataccatacgatc$ 32: gataccatacgatc$ 5: gatacgattacagattacagatcatacgataccatacgatc$ 42: gatc$ 24: gatcatacgataccatacgatc$ 17: gattacagatcatacgataccatacgatc$ 10: gattacagattacagatcatacgataccatacgatc$ 20: tacagatcatacgataccatacgatc$ 13: tacagattacagatcatacgataccatacgatc$ 34: taccatacgatc$ 29: tacgataccatacgatc$ 39: tacgatc$ 7: tacgattacagattacagatcatacgataccatacgatc$ 44: tc$ 26: tcatacgataccatacgatc$ 2: tgagatacgattacagattacagatcatacgataccatacgatc$ 19: ttacagatcatacgataccatacgatc$ 12: ttacagattacagatcatacgataccatacgatc$ Pattern "tac" found at (position: suffix): 7: tacgattacagattacagatcatacgataccatacgatc 13: tacagattacagatcatacgataccatacgatc 20: tacagatcatacgataccatacgatc 29: tacgataccatacgatc 34: taccatacgatc 39: tacgatc

      Yeah, I can see value in getting the binary search to find the "first" potential match; especially how you've avoided using the regex within it.

      Many decent improvements.

      bsearch() is still accessing @suff like a global variable. You may have decided that that is fine.

      I'd rewrite:

      $mid = int( ( $lo + $hi ) / 2 ); if ( $hi == $lo ) { return $mid }

      as

      return $hi if $hi == $lo; $mid = int( ( $lo + $hi ) / 2 );

      though not for any compelling reasons.

      Your bsearch() now returns the first potential match (the first suffix after or equal to your search), so you might want to adjust your output for the case of no matches.

      The return value from bsearch() might be an error message but you unconditionally use it as a number. If Modern::Perl doesn't turn on warnings, then you should probably do that.

      if ( $suff[ $indices[ $index ] ] =~ /^$pattern/ ) { push @positions, $indices[ $index ] + 1; } else { last; }

      can be shortened as:

      last if $suff[ $indices[ $index ] ] !~ /^$pattern/; push @positions, $indices[ $index ] + 1;

      - tye        

        Thanks again Tye!

        I incorporated your changes: bsearch() is no longer accessing global variables, output is adjusted for a 'no match' case, and Modern::Perl does turn on warnings by default.

        #!/usr/bin/perl use Modern::Perl '2011'; use autodie; # pattern searching using suffix arrays and binary search open my $fh, '<', shift; chomp( my( $pattern, $str ) = <$fh> ); $str .= '$'; my @suff; # hold suffixes while ( 1 < length $str ) { # condition excludes '$' from @suff push @suff, $str; substr ( $str, 0, 1, '' ); } # lexically ordered suffix array my @indices = sort { $suff[$a] cmp $suff[$b] } 0 .. $#suff; #for ( 0 .. $#indices ) { # say $indices[$_]+1, ": ", $suff[ $indices[$_] ]; #} my $start = bsearch( \ @indices, $pattern, \ @suff ); # get consecutive positions, if any, # where pattern matches first n chars of suffix. my @positions; for my $index ( $start .. $#indices ) { last if $suff[ $indices[ $index ] ] !~ /^$pattern/; push @positions, $indices[ $index ] + 1; # omit +1 if 0-based inde +xing } # print results if ( @positions ) { # check if a match exists at all say "\nPattern \"$pattern\" found at position:"; for my $pos ( sort{ $a <=> $b } @positions ) { #chop( my $s = $suff[ $pos - 1] ); #suffix print "$pos "; } } else { say "\nPattern \"$pattern\" not found in string"; } print "\n"; # binary search. # find first potential match, ie first suffix after or equal to patter +n, # such that pattern potentially matches first n characters of suffix. sub bsearch { my ( $indref, $pat, $sufref ) = @_; my $mid; my ( $lo, $hi ) = ( 0, $#$indref ); while ( 1 ) { $mid = int( ( $lo + $hi ) / 2 ); return $mid if $hi == $lo; if( ( $pattern cmp $$sufref[ $$indref[ $mid ] ] ) < 0 ) { $hi = $mid; } else { $lo = $mid + 1; } } }

        I did a simple benchmark relative to what you would call a 'naive' pattern search, after checking that both give the same results, with a 9-char pattern and 8569-char string. I like the speed up.

        #!/usr/bin/perl use Modern::Perl '2011'; use autodie; use Benchmark qw/ timethese /; # benchmark pattern searching using naive approach vs suffix arrays open my $fh, '<', shift; chomp( my( $pattern, $str ) = <$fh> ); my $naive = sub { my @positions; while ( $str =~ /(?=($pattern))/g ) { push @positions, $-[0];; } #say "@positions"; }; my $sufbin = sub { $str .= '$'; my @suff; # hold suffixes while ( 1 < length $str ) { # condition excludes '$' from @suff push @suff, $str; substr ( $str, 0, 1, '' ); } # lexically ordered suffix array my @indices = sort { $suff[$a] cmp $suff[$b] } 0 .. $#suff; #for ( 0 .. $#indices ) { # say $indices[$_]+1, ": ", $suff[ $indices[$_] ]; #} my $start = bsearch( \ @indices, $pattern, \ @suff ); # get consecutive positions, if any, # where pattern matches first n chars of suffix. my @positions; for my $index ( $start .. $#indices ) { last if $suff[ $indices[ $index ] ] !~ /^$pattern/; push @positions, $indices[ $index ];# + 1; # omit +1 if 0-base +d indexing } # binary search. # find first potential match, ie first suffix after or equal to pa +ttern, # such that pattern potentially matches first n characters of suff +ix. sub bsearch { my ( $indref, $pat, $sufref ) = @_; my $mid; my ( $lo, $hi ) = ( 0, $#$indref ); while ( 1 ) { $mid = int( ( $lo + $hi ) / 2 ); return $mid if $hi == $lo; if( ( $pattern cmp $$sufref[ $$indref[ $mid ] ] ) < 0 ) { $hi = $mid; } else { $lo = $mid + 1; } } } }; timethese( -5, { Suffixbinary => $sufbin, Naive => $naive, } ); #output abualiga:~$ ./benchmarkPatternSearch.pl patternSearchData.txt Benchmark: running Naive, Suffixbinary for at least 5 CPU seconds... Naive: 6 wallclock secs ( 5.32 usr + 0.00 sys = 5.32 CPU) @ 92 +5.75/s (n=4925) Suffixbinary: 6 wallclock secs ( 5.33 usr + 0.00 sys = 5.33 CPU) @ +232758.91/s (n=1240605)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1057453]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (9)
As of 2014-07-28 08:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (193 votes), past polls