Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re^4: suffix array efficiency

by oiskuu (Hermit)
on Jan 20, 2014 at 19:50 UTC ( #1071374=note: print w/replies, xml ) Need Help??

in reply to Re^3: suffix array efficiency
in thread suffix array efficiency

The degenerate (1-letter) case:

20000 1 -5 kennethk 0.154/s RobertCraven 7.55/s xxx 23.0/s hdb 28.5/s
The 4-letter case packs only 8 bits of data into 32-bit values and therefore performs poorly. With genomic data one would of course pack differently.

Much more importantly: is there no XS module for constructing suffix arrays? Seems like a very worthy problem.

Replies are listed 'Best First'.
Re^5: suffix array efficiency
by kennethk (Abbot) on Jan 20, 2014 at 20:16 UTC
    I'm not a bioinformatics guy, but if massive runs are actually a significant concern in this system, it's pretty easy to fix the worst-case behavior by adaptively boosting the window, so that worst case regresses to hdb's solution:
    sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; $step *= 2; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
    which gives a degenerate-case comparison of
    RobertCraven 2.90/s kennethk 12.8/s xxx 13.1/s hdb 14.2/s
    and comparable behavior in the rest of the benchmarks.

    I also note your computer is twice as fast as mine. Wanna trade?

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1071374]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2018-06-20 06:00 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (116 votes). Check out past polls.