Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re^2: suffix array efficiency

by oiskuu (Hermit)
on Jan 20, 2014 at 18:44 UTC ( #1071365=note: print w/replies, xml ) Need Help??

in reply to Re: suffix array efficiency
in thread suffix array efficiency

Please include comparison(20000,1,10); in your bench.
Your caching strategy blows up in the degenerate case where $string eq "aaaa"...

I'd suggest something simpler:

sub xxx { my $string = shift; my @cache = unpack q(x*X3 .@0/(NX3)), $string."\0\0\0"; my @indices = sort {$cache[$a] <=> $cache[$b] || substr($string, $a) cmp substr($string, $b) } 0 .. $#cache; $_++ for @indices; return @indices; }
Probably only works in C locale. Correctness unverified.

Update: Oops, copy-pasta fix.

Replies are listed 'Best First'.
Re^3: suffix array efficiency
by kennethk (Abbot) on Jan 20, 2014 at 19:09 UTC
    As mentioned in the parent, I found a bug in the caching code. I'm not worried about the degenerate case, though obviously my approach incurs a penalty in worst case scenario. I was going to benchmark your code, but couldn't get it to pass my updated validation routine:
    sub verify { for my $letters (1, 4, 26) { my $input = join '', map { ('a'..'z')[rand($letters)] } 0..200 +00-1; my @RobertCraven = RobertCraven($input); my @hdb = hdb($input); my @kennethk = kennethk($input); my @xxx = xxx($input); die "Mismatch 1 ($letters)" if @RobertCraven != @hdb; die "Mismatch 2 ($letters)" if @RobertCraven != @kennethk; die "Mismatch 3 ($letters)" if @RobertCraven != @xxx; for my $i (0 .. $#RobertCraven) { die "Char mismatch 1($letters, $i)" if $RobertCraven[$i] n +e $hdb[$i]; die "Char mismatch 2($letters, $i)" if $RobertCraven[$i] n +e $kennethk[$i]; die "Char mismatch 3($letters, $i)" if $RobertCraven[$i] n +e $xxx[$i]; } } }
    Does it pass on your machine?

    Update: Parental code now passes tests.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re^3: suffix array efficiency
by kennethk (Abbot) on Jan 20, 2014 at 19:23 UTC
    Benchmark results:
    20000 26 20 Rate hdb RobertCraven kennethk xx +x hdb 2.57/s -- -8% -55% -84 +% RobertCraven 2.81/s 9% -- -51% -82 +% kennethk 5.77/s 124% 106% -- -64 +% xxx 15.8/s 515% 464% 174% - +- 20000 4 20 Rate hdb RobertCraven xxx kenneth +k hdb 2.82/s -- -11% -46% -54 +% RobertCraven 3.18/s 13% -- -40% -48 +% xxx 5.28/s 87% 66% -- -13 +% kennethk 6.08/s 115% 91% 15% - +- 200000 26 5 s/iter kennethk xxx kennethk 2.05 -- -32% xxx 1.40 47% -- 200000 4 5 s/iter xxx kennethk xxx 24.6 -- -92% kennethk 2.07 1090% --
    Never been particularly proficient with pack, but it looks like your approach falls apart with the restricted character set. OTOH, that's an impressive result for [20000,26,20]. I imagine optimizing your unpack could yield some impressive optimization.

    And, FYI, the windowing approach is about 30x slower than other options with only 1 letter.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      The degenerate (1-letter) case:

      20000 1 -5 kennethk 0.154/s RobertCraven 7.55/s xxx 23.0/s hdb 28.5/s
      The 4-letter case packs only 8 bits of data into 32-bit values and therefore performs poorly. With genomic data one would of course pack differently.

      Much more importantly: is there no XS module for constructing suffix arrays? Seems like a very worthy problem.

        I'm not a bioinformatics guy, but if massive runs are actually a significant concern in this system, it's pretty easy to fix the worst-case behavior by adaptively boosting the window, so that worst case regresses to hdb's solution:
        sub kennethk { my $string = shift; my $off = 0; my $step = 10; my @indices = sort {$off = 0; until (substr($string, $a+$off, $step) cmp sub +str($string, $b+$off, $step)){ $off += $step; $step *= 2; } + } 0 .. length($string) - 1; $_++ for @indices; return @indices; }
        which gives a degenerate-case comparison of
        RobertCraven 2.90/s kennethk 12.8/s xxx 13.1/s hdb 14.2/s
        and comparable behavior in the rest of the benchmarks.

        I also note your computer is twice as fast as mine. Wanna trade?

        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1071365]
[Eily]: I still don't understand how the Turkish AA fit into the German+Czech joke though :P
[LanX]: new Firefox + cb sidebar do random auto expand on submit
[LanX]: probably need to start pm discussion
[LanX]: they have a constitutional referendum in turkey, kind of "do you want a dictator" and everybody opting no gets problems ...
[Corion]: LanX: Random Auto Expand?
[Corion]: LanX: Well, everybody opting "yes" will also get problems. The question is more like "Do you want problems now or problems later?"
[LanX]: and the AA had posters with a big say NO with a small "to alcohol"
[ambrus]: LanX: is it the kind of free and secret vote where there's only one box you can check?

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2017-03-27 12:14 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (320 votes). Check out past polls.