Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^4: suffix array efficiency

by oiskuu (Friar)
on Jan 20, 2014 at 19:50 UTC ( #1071374=note: print w/ replies, xml ) Need Help??


in reply to Re^3: suffix array efficiency
in thread suffix array efficiency

The degenerate (1-letter) case:

20000 1 -5 kennethk 0.154/s RobertCraven 7.55/s xxx 23.0/s hdb 28.5/s
The 4-letter case packs only 8 bits of data into 32-bit values and therefore performs poorly. With genomic data one would of course pack differently.

Much more importantly: is there no XS module for constructing suffix arrays? Seems like a very worthy problem.


Comment on Re^4: suffix array efficiency
Download Code
Re^5: suffix array efficiency
by kennethk (Abbot) on Jan 20, 2014 at 20:16 UTC
    I'm not a bioinformatics guy, but if massive runs are actually a significant concern in this system, it's pretty easy to fix the worst-case behavior by adaptively boosting the window, so that worst case regresses to hdb's solution:
    sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; $step *= 2; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
    which gives a degenerate-case comparison of
    RobertCraven 2.90/s kennethk 12.8/s xxx 13.1/s hdb 14.2/s
    and comparable behavior in the rest of the benchmarks.

    I also note your computer is twice as fast as mine. Wanna trade?


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1071374]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (12)
As of 2015-07-06 19:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (81 votes), past polls