Greatly appreciate your input Tye!
I accepted your corrections, and will adopt some of your style practices. I messed with the binary search sub a bit and may have stumbled upon a way to get the first index. I tried several random DNA strings, and it works. Please see below and let me know your thoughts, at your convenience.
thanks!
#!/usr/bin/perl
use Modern::Perl '2011';
use autodie;
my $str = 'atgagatacgattacagattacagatcatacgataccatacgatc$';
my @suff; # hold suffixes
while ( 1 < length $str ) { # condition excludes '$' from @suff
push @suff, $str;
substr ( $str, 0, 1, '' );
}
# lexically ordered suffix array
my @indices = sort { $suff[$a] cmp $suff[$b] } 0 .. $#suff;
for ( 0 .. $#indices ) {
say $indices[$_]+1, ": ", $suff[ $indices[$_] ];
}
my $pattern = 'tac';
# find smallest index where pattern matches first n characters of suff
+ix
my $start = bsearch( \@indices, $pattern );
# get consecutive positions where pattern matches first n chars of suf
+fix
my @positions;
for my $index ( $start .. $#indices ) {
if ( $suff[ $indices[ $index ] ] =~ /^$pattern/ ) {
push @positions, $indices[ $index ] + 1;
}
else {
last;
}
}
# print results
say "\nPattern \"$pattern\" found at (position: suffix):";
for my $pos ( sort{ $a <=> $b } @positions ) {
chop( my $s = $suff[ $pos - 1] );
say "$pos: $s";
}
sub bsearch {
my ( $indref, $pat ) = @_;
my $mid;
my ( $lo, $hi ) = ( 0, $#$indref );
while ( 1 ) {
$mid = int( ( $lo + $hi ) / 2 );
if ( $hi == $lo ) { return $mid }
if( ( $pattern cmp $suff[ $indices[ $mid ] ] ) < 0 ) {
$hi = $mid;
}
else {
$lo = $mid + 1;
}
}
return "pattern not found in string!";
}
#output
abualiga:~$ ./suffixArray.pl
21: acagatcatacgataccatacgatc$
14: acagattacagatcatacgataccatacgatc$
35: accatacgatc$
30: acgataccatacgatc$
40: acgatc$
8: acgattacagattacagatcatacgataccatacgatc$
4: agatacgattacagattacagatcatacgataccatacgatc$
23: agatcatacgataccatacgatc$
16: agattacagatcatacgataccatacgatc$
33: ataccatacgatc$
28: atacgataccatacgatc$
38: atacgatc$
6: atacgattacagattacagatcatacgataccatacgatc$
43: atc$
25: atcatacgataccatacgatc$
1: atgagatacgattacagattacagatcatacgataccatacgatc$
18: attacagatcatacgataccatacgatc$
11: attacagattacagatcatacgataccatacgatc$
45: c$
22: cagatcatacgataccatacgatc$
15: cagattacagatcatacgataccatacgatc$
27: catacgataccatacgatc$
37: catacgatc$
36: ccatacgatc$
31: cgataccatacgatc$
41: cgatc$
9: cgattacagattacagatcatacgataccatacgatc$
3: gagatacgattacagattacagatcatacgataccatacgatc$
32: gataccatacgatc$
5: gatacgattacagattacagatcatacgataccatacgatc$
42: gatc$
24: gatcatacgataccatacgatc$
17: gattacagatcatacgataccatacgatc$
10: gattacagattacagatcatacgataccatacgatc$
20: tacagatcatacgataccatacgatc$
13: tacagattacagatcatacgataccatacgatc$
34: taccatacgatc$
29: tacgataccatacgatc$
39: tacgatc$
7: tacgattacagattacagatcatacgataccatacgatc$
44: tc$
26: tcatacgataccatacgatc$
2: tgagatacgattacagattacagatcatacgataccatacgatc$
19: ttacagatcatacgataccatacgatc$
12: ttacagattacagatcatacgataccatacgatc$
Pattern "tac" found at (position: suffix):
7: tacgattacagattacagatcatacgataccatacgatc
13: tacagattacagatcatacgataccatacgatc
20: tacagatcatacgataccatacgatc
29: tacgataccatacgatc
34: taccatacgatc
39: tacgatc
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|