Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

sorting array of arrays reference

by kimlid2810 (Acolyte)
on Oct 25, 2013 at 17:27 UTC ( [id://1059713]=perlquestion: print w/replies, xml ) Need Help??

kimlid2810 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks. Your wisdom is needed! :) i have an array of many many way too many arrays. each element of every minion array is of various types, but the third element of every minion array in the master array is always a positive integer. this master array, i pass it to a function as a reference. i desperately need you to recommend the fastest way, with which, i can keep these three minion arrays of the master array, that contain the elements with the highest integers and discard the rest of the minion arrays... i.e.:
my @masterArray = ( ["this", "that", 12563, "something", "else"], ["this", "that", 10, "something", "else"], ["this", "that", 1, "something", "else"], ["this", "that", 125638, "something", "else"], ["this", "that", 300000, "something", "else"], ); subDiscarder(\@masterArray); # # AND THEN @masterArray should become: # @masterArray = ( ["this", "that", 300000, "something", "else"], ["this", "that", 125638, "something", "else"], ["this", "that", 12563, "something", "else"], );
being sorted is not of great importance. What is crucial, is the speed. I could figure out ways with foreaches and constant assignments of every element, but if there is a trully fast way to do it, or a module that can do this, i would appreciate it if someone could give me a hint towards there. thank you for your time. :)

Replies are listed 'Best First'.
Re: sorting array of arrays reference
by salva (Canon) on Oct 25, 2013 at 17:38 UTC
    Use Sort::Key::Top which implements the Selection Algorithm:
    use Sort::Key::Top qw(rikeytopsort); my @sorted = rikeytopsort { $_->[2] } 3 => @masterArray; # rikeytopsort means... # r => reverse (descending) order # i => the key is an integer # key => use a key extraction block # top => get the top n elements # sort => sorted
Re: sorting array of arrays reference
by choroba (Cardinal) on Oct 26, 2013 at 06:38 UTC
    Just keep the best three results so far:
    use List::Util qw{ first }; sub subDiscarder { my @top3 = map [0, 0, -inf], 0 .. 2; for my $minion (@{ $_[0] }) { next if $minion->[2] < $top3[2][2]; my $idx = first { $top3[$_][2] <= $minion->[2] } 0 .. 2; splice @top3, $idx, 0, $minion; splice @top3, 3; } @{ $_[0] } = @top3; }
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: sorting array of arrays reference
by Kenosis (Priest) on Oct 25, 2013 at 17:55 UTC

    Perhaps the following will be helpful:

    use strict; use warnings; use Data::Dumper; my @masterArray = ( [ "this", "that", 12563, "something", "else" ], [ "this", "that", 10, "something", "else" ], [ "this", "that", 1, "something", "else" ], [ "this", "that", 125638, "something", "else" ], [ "this", "that", 300000, "something", "else" ], ); subDiscarder( \@masterArray ); print Dumper \@masterArray; sub subDiscarder { my ($arrayRef) = @_; my %nums = map $_ => 1, ( sort { $b <=> $a } map $_->[2], @$arrayRef )[ 0 .. 2 ]; @$arrayRef = grep exists $nums{ $_->[2] }, @$arrayRef; }

    Output:

    $VAR1 = [ [ 'this', 'that', 12563, 'something', 'else' ], [ 'this', 'that', 125638, 'something', 'else' ], [ 'this', 'that', 300000, 'something', 'else' ] ];

    The subroutine sorts and then takes the top three largest 'minion' numbers to populate a hash, then greps through the 'master' array to allow only the three 'minion' arrays pass which contain one of those three top values.

Re: sorting array of arrays reference
by Anonymous Monk on Oct 25, 2013 at 17:32 UTC
Re: sorting array of arrays reference
by salva (Canon) on Oct 26, 2013 at 16:46 UTC
    A concise O(N) solution:
    my @top = @masterArray[0..2]; @top = (sort { $b->[2] <=> $a->[2] } @top, $masterArray[$_])[0..2] for + 3..$#masterArray;
    Though, I have no idea how fast it would be.
Re: sorting array of arrays reference
by Laurent_R (Canon) on Oct 25, 2013 at 17:58 UTC

    If you need only the three top elements, then sorting the array of arrays is overkill and will be inefficient if the master array is large. Not good if speed is really important.

    I would go for a selection algorithm could be done in two different ways: (1) the simpler (but slower) is to walk through the array once and memorize the array ref of the top element; then walk again through and find the next to top element, and a third time; (2) slightly more complicated is to walk through the array only once and to maintain as you go an auxiliary array of the three top elements; maintaining this auxiliary array is somewhat complicated, but nothing insurmountable.

      how is this auxiliary array going to be maintained?

        OK, a bit more time now, this is one possible way of doing it:

        use strict; use warnings; use Data::Dumper; my @masterArray = ( ["this", "that", 12563, "something", "else"], ["this", "that", 10, "something", "else"], ["this", "that", 1, "something", "else"], ["this", "that", 125638, "something", "else"], ["this", "that", 300000, "something", "else"], ); my @top3 = sort {$b->[2] <=> $a->[2]} @masterArray[0..2]; my $min_top = $top3[2][2]; for my $sub_aref (@masterArray [3..$#masterArray]) { next if $sub_aref <= $min_top; @top3 = (sort {$b->[2] <=> $a->[2]} @top3, $sub_aref)[0..2]; $min_top = $top3[2][2]; } print Dumper @top3;

        This yields the following result:

        $ perl subdiscard.pl $VAR1 = [ 'this', 'that', 300000, 'something', 'else' ]; $VAR2 = [ 'this', 'that', 125638, 'something', 'else' ]; $VAR3 = [ 'this', 'that', 12563, 'something', 'else' ];

        A more general solution might be like this:

        use strict; use warnings; use Data::Dumper; my $nb_elements = shift; chomp $nb_elements ; my @masterArray; push @masterArray, ["", "", int rand (1e7), ""] for 1..$nb_elements; # print Dumper \@masterArray; my @top3 = sort {$b->[2] <=> $a->[2]} @masterArray[0..2]; my $min_top = $top3[2][2]; $nb_elements--; for my $sub_aref (@masterArray [3..$nb_elements]) { next if $sub_aref->[2] <= $min_top; @top3 = (sort {$b->[2] <=> $a->[2]} @top3, $sub_aref)[0..2]; $min_top = $top3[2][2]; } print Dumper \@top3;

        With one million records, the execution time is about 2.5 seconds:

        $ time perl subdiscard2.pl 1000000 $VAR1 = [ [ '', '', 9999996, '' ], [ '', '', 9999993, '' ], [ '', '', 9999990, '' ] ]; real 0m2.497s user 0m2.386s sys 0m0.108s

        Sorting the original array and taking the first 3 elements takes about 3 times longer:

        $ time perl subdiscard3.pl 1000000 $VAR1 = [ [ '', '', 9999980, '' ], [ '', '', 9999955, '' ], [ '', '', 9999944, '' ] ]; real 0m7.605s user 0m7.518s sys 0m0.093s

        But, in fact, in the 2.5 seconds taken by the program above, most of it (more than 2.2 seconds) is used for populating the array with random values, so that the difference between the algorithm presented above and a pure sort is much larger than it appears, probably at least a factor of 10. I'll do a real benchmark later if I can find the time.

        Nothing really complicated, But I just can't do it right now. No time.
      Think of this as the first three passes of a bubble sort. I like it.
      Bill
Re: sorting array of arrays reference
by hdb (Monsignor) on Oct 25, 2013 at 21:48 UTC
Re: sorting array of arrays reference
by Laurent_R (Canon) on Oct 26, 2013 at 21:03 UTC

    OK, now I have benchmarked the sort and the walk through only once solution (two variants):

    use strict; use warnings; use Benchmark qw(timethese); my $nb_elements = 2e6; my @top3; my @startArray; push @startArray, ["", "", int rand (1e7)] for 1..$nb_elements; sub discard_sort{ my @masterArray = @startArray; my @top3 = sort {$b->[2] <=> $a->[2]} @masterArray[0..2]; my $min_top = $top3[2][2]; $nb_elements--; for my $sub_aref (@masterArray [3..$nb_elements]) { next if $sub_aref->[2] <= $min_top; @top3 = (sort {$b->[2] <=> $a->[2]} $sub_aref, @top3)[0..2]; $min_top = $top3[2][2]; } } sub discard_manual{ my @masterArray = @startArray; @top3 = sort {$b->[2] <=> $a->[2]} @masterArray[0..2]; my $min_top = $top3[2][2]; $nb_elements--; for my $sub_aref (@masterArray [3..$nb_elements]) { next if $sub_aref->[2] <= $min_top; if ($sub_aref->[2] > $top3[0][2]) { unshift @top3, $sub_aref; pop @top3; } elsif ( $sub_aref->[2] > $top3[1][2]) { $top3[2] = $top3[1]; $top3[1] = $sub_aref; } else { pop @top3; push @top3, $sub_aref; } $min_top = $top3[2][2]; } } sub sort_strategy { my @masterArray = @startArray; my @top3 = (sort {$b->[2] <=> $a->[2]} @masterArray)[0..2]; } timethese (5, { 'walk once with sort' => \&discard_sort, 'walk once manual insert' => \&discard_manual, 'sort' => \&sort_strategy});

    The results show pretty clearly that the walk-through once solution which I originally proposed is much much faster than a sort on the full array:

    Benchmark: timing 5 iterations of sort, walk once manual insert, walk +once with sort... sort: 60 wallclock secs (60.37 usr + 0.02 sys = 60.39 CPU) @ 0.08/s +(n=5) walk once manual insert: 3 wallclock secs ( 2.79 usr + 0.00 sys = 2 +.79 CPU) @ 1.79/s (n=5) walk once with sort: 3 wallclock secs ( 2.76 usr + 0.00 sys = 2.76 +CPU) @ 1.81/s (n=5)

    On a dataset of 2 million records, the walk through once solution is about 21 times faster than the solution sorting the full array. For the walk through once solution, I tried two variants: using sort to maintain the auxiliary array and doing a manual insertion sort on this auxiliary array. It turns out there is very little difference between these two solutions, but the simpler sort function solution is actually slightly better, so it was not worth the effort of trying a manual insertion sort.

    The larger the dataset, the larger the advantage of the walk through once strategy: with 5 million records, the walk-through once solution is 24.6 faster than the full array sort:

    Benchmark: timing 5 iterations of sort, walk once manual insert, walk +once with sort... sort: 170 wallclock secs (169.96 usr + 0.03 sys = 169.99 CPU) @ 0.03 +/s (n=5) walk once manual insert: 7 wallclock secs ( 6.90 usr + 0.03 sys = 6 +.93 CPU) @ 0.72/s (n=5) walk once with sort: 7 wallclock secs ( 6.86 usr + 0.00 sys = 6.86 +CPU) @ 0.73/s (n=5)

    Note that, in order to produce realistic benchmark figures for the sort solution, I had to make a copy of the original array, so that the sort would not work on an already sorted array. Part of the recorded durations is therefore wasted making a copy of the array. If we were to subtract this part from the recorded duration, the advantage of the walk through once solution would be even larger. The copying of the 5-million record array takes about 2 seconds, so that the corrected figures would be 168 sec. for the sort, and 5 sec. for the walk through once solutions, meaning that the later is actually about 33 times faster than the sort.

      hats off Laurent. I really thank you all for your replies and time. laurent, i ll suggest my supervisor to send you a check or bitcoins :) <3
Re: sorting array of arrays reference
by Anonymous Monk on Oct 25, 2013 at 23:15 UTC

    Trading space for time (note that delete() on an array is deprecated):

    my @sort; $sort[$_->[2]] = $_ for @masterArray; my @top = map {delete $sort[-1]} 1..3; say Dumper \@top;

    Output:

    $VAR1 = [ [ 'this', 'that', 300000, 'something', 'else' ], [ 'this', 'that', 125638, 'something', 'else' ], [ 'this', 'that', 12563, 'something', 'else' ] ];

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1059713]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-03-29 07:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found