Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^2: suffix array efficiency

by oiskuu (Friar)
on Jan 20, 2014 at 18:44 UTC ( #1071365=note: print w/ replies, xml ) Need Help??


in reply to Re: suffix array efficiency
in thread suffix array efficiency

Please include comparison(20000,1,10); in your bench.
Your caching strategy blows up in the degenerate case where $string eq "aaaa"...

I'd suggest something simpler:

sub xxx { my $string = shift; my @cache = unpack q(x*X3 .@0/(NX3)), $string."\0\0\0"; my @indices = sort {$cache[$a] <=> $cache[$b] || substr($string, $a) cmp substr($string, $b) } 0 .. $#cache; $_++ for @indices; return @indices; }
Probably only works in C locale. Correctness unverified.

Update: Oops, copy-pasta fix.


Comment on Re^2: suffix array efficiency
Select or Download Code
Replies are listed 'Best First'.
Re^3: suffix array efficiency
by kennethk (Abbot) on Jan 20, 2014 at 19:09 UTC
    As mentioned in the parent, I found a bug in the caching code. I'm not worried about the degenerate case, though obviously my approach incurs a penalty in worst case scenario. I was going to benchmark your code, but couldn't get it to pass my updated validation routine:
    sub verify { for my $letters (1, 4, 26) { my $input = join '', map { ('a'..'z')[rand($letters)] } 0..200 +00-1; my @RobertCraven = RobertCraven($input); my @hdb = hdb($input); my @kennethk = kennethk($input); my @xxx = xxx($input); die "Mismatch 1 ($letters)" if @RobertCraven != @hdb; die "Mismatch 2 ($letters)" if @RobertCraven != @kennethk; die "Mismatch 3 ($letters)" if @RobertCraven != @xxx; for my $i (0 .. $#RobertCraven) { die "Char mismatch 1($letters, $i)" if $RobertCraven[$i] n +e $hdb[$i]; die "Char mismatch 2($letters, $i)" if $RobertCraven[$i] n +e $kennethk[$i]; die "Char mismatch 3($letters, $i)" if $RobertCraven[$i] n +e $xxx[$i]; } } }
    Does it pass on your machine?

    Update: Parental code now passes tests.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re^3: suffix array efficiency
by kennethk (Abbot) on Jan 20, 2014 at 19:23 UTC
    Benchmark results:
    20000 26 20 Rate hdb RobertCraven kennethk xx +x hdb 2.57/s -- -8% -55% -84 +% RobertCraven 2.81/s 9% -- -51% -82 +% kennethk 5.77/s 124% 106% -- -64 +% xxx 15.8/s 515% 464% 174% - +- 20000 4 20 Rate hdb RobertCraven xxx kenneth +k hdb 2.82/s -- -11% -46% -54 +% RobertCraven 3.18/s 13% -- -40% -48 +% xxx 5.28/s 87% 66% -- -13 +% kennethk 6.08/s 115% 91% 15% - +- 200000 26 5 s/iter kennethk xxx kennethk 2.05 -- -32% xxx 1.40 47% -- 200000 4 5 s/iter xxx kennethk xxx 24.6 -- -92% kennethk 2.07 1090% --
    Never been particularly proficient with pack, but it looks like your approach falls apart with the restricted character set. OTOH, that's an impressive result for [20000,26,20]. I imagine optimizing your unpack could yield some impressive optimization.

    And, FYI, the windowing approach is about 30x slower than other options with only 1 letter.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      The degenerate (1-letter) case:

      20000 1 -5 kennethk 0.154/s RobertCraven 7.55/s xxx 23.0/s hdb 28.5/s
      The 4-letter case packs only 8 bits of data into 32-bit values and therefore performs poorly. With genomic data one would of course pack differently.

      Much more importantly: is there no XS module for constructing suffix arrays? Seems like a very worthy problem.

        I'm not a bioinformatics guy, but if massive runs are actually a significant concern in this system, it's pretty easy to fix the worst-case behavior by adaptively boosting the window, so that worst case regresses to hdb's solution:
        sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; $step *= 2; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
        which gives a degenerate-case comparison of
        RobertCraven 2.90/s kennethk 12.8/s xxx 13.1/s hdb 14.2/s
        and comparable behavior in the rest of the benchmarks.

        I also note your computer is twice as fast as mine. Wanna trade?


        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1071365]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (9)
As of 2015-07-08 05:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (94 votes), past polls