kimlid2810 has asked for the wisdom of the Perl Monks concerning the following question:
Hello Monks. Your wisdom is needed! :)
i have an array of many many way too many arrays. each element of every minion array is of various types, but the third element of every minion array in the master array is always a positive integer. this master array, i pass it to a function as a reference. i desperately need you to recommend the fastest way, with which, i can keep these three minion arrays of the master array, that contain the elements with the highest integers and discard the rest of the minion arrays... i.e.:
my @masterArray = (
["this", "that", 12563, "something", "else"],
["this", "that", 10, "something", "else"],
["this", "that", 1, "something", "else"],
["this", "that", 125638, "something", "else"],
["this", "that", 300000, "something", "else"],
);
subDiscarder(\@masterArray);
#
# AND THEN @masterArray should become:
#
@masterArray = (
["this", "that", 300000, "something", "else"],
["this", "that", 125638, "something", "else"],
["this", "that", 12563, "something", "else"],
);
being sorted is not of great importance. What is crucial, is the speed. I could figure out ways with foreaches and constant assignments of every element, but if there is a trully fast way to do it, or a module that can do this, i would appreciate it if someone could give me a hint towards there. thank you for your time. :)
Re: sorting array of arrays reference
by salva (Canon) on Oct 25, 2013 at 17:38 UTC
|
use Sort::Key::Top qw(rikeytopsort);
my @sorted = rikeytopsort { $_->[2] } 3 => @masterArray;
# rikeytopsort means...
# r => reverse (descending) order
# i => the key is an integer
# key => use a key extraction block
# top => get the top n elements
# sort => sorted
| [reply] [Watch: Dir/Any] [d/l] |
Re: sorting array of arrays reference
by choroba (Cardinal) on Oct 26, 2013 at 06:38 UTC
|
Just keep the best three results so far:
use List::Util qw{ first };
sub subDiscarder {
my @top3 = map [0, 0, -inf], 0 .. 2;
for my $minion (@{ $_[0] }) {
next if $minion->[2] < $top3[2][2];
my $idx = first { $top3[$_][2] <= $minion->[2] } 0 .. 2;
splice @top3, $idx, 0, $minion;
splice @top3, 3;
}
@{ $_[0] } = @top3;
}
| [reply] [Watch: Dir/Any] [d/l] |
Re: sorting array of arrays reference
by Kenosis (Priest) on Oct 25, 2013 at 17:55 UTC
|
use strict;
use warnings;
use Data::Dumper;
my @masterArray = (
[ "this", "that", 12563, "something", "else" ],
[ "this", "that", 10, "something", "else" ],
[ "this", "that", 1, "something", "else" ],
[ "this", "that", 125638, "something", "else" ],
[ "this", "that", 300000, "something", "else" ],
);
subDiscarder( \@masterArray );
print Dumper \@masterArray;
sub subDiscarder {
my ($arrayRef) = @_;
my %nums =
map $_ => 1,
( sort { $b <=> $a } map $_->[2], @$arrayRef )[ 0 .. 2 ];
@$arrayRef = grep exists $nums{ $_->[2] }, @$arrayRef;
}
Output:
$VAR1 = [
[
'this',
'that',
12563,
'something',
'else'
],
[
'this',
'that',
125638,
'something',
'else'
],
[
'this',
'that',
300000,
'something',
'else'
]
];
The subroutine sorts and then takes the top three largest 'minion' numbers to populate a hash, then greps through the 'master' array to allow only the three 'minion' arrays pass which contain one of those three top values. | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: sorting array of arrays reference
by Anonymous Monk on Oct 25, 2013 at 17:32 UTC
|
| [reply] [Watch: Dir/Any] |
Re: sorting array of arrays reference
by salva (Canon) on Oct 26, 2013 at 16:46 UTC
|
my @top = @masterArray[0..2];
@top = (sort { $b->[2] <=> $a->[2] } @top, $masterArray[$_])[0..2] for
+ 3..$#masterArray;
Though, I have no idea how fast it would be. | [reply] [Watch: Dir/Any] [d/l] |
Re: sorting array of arrays reference
by Laurent_R (Canon) on Oct 25, 2013 at 17:58 UTC
|
If you need only the three top elements, then sorting the array of arrays is overkill and will be inefficient if the master array is large. Not good if speed is really important.
I would go for a selection algorithm could be done in two different ways: (1) the simpler (but slower) is to walk through the array once and memorize the array ref of the top element; then walk again through and find the next to top element, and a third time; (2) slightly more complicated is to walk through the array only once and to maintain as you go an auxiliary array of the three top elements; maintaining this auxiliary array is somewhat complicated, but nothing insurmountable.
| [reply] [Watch: Dir/Any] |
|
how is this auxiliary array going to be maintained?
| [reply] [Watch: Dir/Any] |
|
use strict;
use warnings;
use Data::Dumper;
my @masterArray = (
["this", "that", 12563, "something", "else"],
["this", "that", 10, "something", "else"],
["this", "that", 1, "something", "else"],
["this", "that", 125638, "something", "else"],
["this", "that", 300000, "something", "else"],
);
my @top3 = sort {$b->[2] <=> $a->[2]} @masterArray[0..2];
my $min_top = $top3[2][2];
for my $sub_aref (@masterArray [3..$#masterArray]) {
next if $sub_aref <= $min_top;
@top3 = (sort {$b->[2] <=> $a->[2]} @top3, $sub_aref)[0..2];
$min_top = $top3[2][2];
}
print Dumper @top3;
This yields the following result:
$ perl subdiscard.pl
$VAR1 = [
'this',
'that',
300000,
'something',
'else'
];
$VAR2 = [
'this',
'that',
125638,
'something',
'else'
];
$VAR3 = [
'this',
'that',
12563,
'something',
'else'
];
A more general solution might be like this:
use strict;
use warnings;
use Data::Dumper;
my $nb_elements = shift;
chomp $nb_elements ;
my @masterArray;
push @masterArray, ["", "", int rand (1e7), ""] for 1..$nb_elements;
# print Dumper \@masterArray;
my @top3 = sort {$b->[2] <=> $a->[2]} @masterArray[0..2];
my $min_top = $top3[2][2];
$nb_elements--;
for my $sub_aref (@masterArray [3..$nb_elements]) {
next if $sub_aref->[2] <= $min_top;
@top3 = (sort {$b->[2] <=> $a->[2]} @top3, $sub_aref)[0..2];
$min_top = $top3[2][2];
}
print Dumper \@top3;
With one million records, the execution time is about 2.5 seconds:
$ time perl subdiscard2.pl 1000000
$VAR1 = [
[
'',
'',
9999996,
''
],
[
'',
'',
9999993,
''
],
[
'',
'',
9999990,
''
]
];
real 0m2.497s
user 0m2.386s
sys 0m0.108s
Sorting the original array and taking the first 3 elements takes about 3 times longer:
$ time perl subdiscard3.pl 1000000
$VAR1 = [
[
'',
'',
9999980,
''
],
[
'',
'',
9999955,
''
],
[
'',
'',
9999944,
''
]
];
real 0m7.605s
user 0m7.518s
sys 0m0.093s
But, in fact, in the 2.5 seconds taken by the program above, most of it (more than 2.2 seconds) is used for populating the array with random values, so that the difference between the algorithm presented above and a pure sort is much larger than it appears, probably at least a factor of 10. I'll do a real benchmark later if I can find the time.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Nothing really complicated, But I just can't do it right now. No time.
| [reply] [Watch: Dir/Any] |
|
Think of this as the first three passes of a bubble sort. I like it.
| [reply] [Watch: Dir/Any] |
Re: sorting array of arrays reference
by hdb (Monsignor) on Oct 25, 2013 at 21:48 UTC
|
| [reply] [Watch: Dir/Any] |
Re: sorting array of arrays reference
by Laurent_R (Canon) on Oct 26, 2013 at 21:03 UTC
|
OK, now I have benchmarked the sort and the walk through only once solution (two variants):
use strict;
use warnings;
use Benchmark qw(timethese);
my $nb_elements = 2e6;
my @top3;
my @startArray;
push @startArray, ["", "", int rand (1e7)] for 1..$nb_elements;
sub discard_sort{
my @masterArray = @startArray;
my @top3 = sort {$b->[2] <=> $a->[2]} @masterArray[0..2];
my $min_top = $top3[2][2];
$nb_elements--;
for my $sub_aref (@masterArray [3..$nb_elements]) {
next if $sub_aref->[2] <= $min_top;
@top3 = (sort {$b->[2] <=> $a->[2]} $sub_aref, @top3)[0..2];
$min_top = $top3[2][2];
}
}
sub discard_manual{
my @masterArray = @startArray;
@top3 = sort {$b->[2] <=> $a->[2]} @masterArray[0..2];
my $min_top = $top3[2][2];
$nb_elements--;
for my $sub_aref (@masterArray [3..$nb_elements]) {
next if $sub_aref->[2] <= $min_top;
if ($sub_aref->[2] > $top3[0][2]) {
unshift @top3, $sub_aref;
pop @top3;
}
elsif ( $sub_aref->[2] > $top3[1][2]) {
$top3[2] = $top3[1];
$top3[1] = $sub_aref;
} else {
pop @top3;
push @top3, $sub_aref;
}
$min_top = $top3[2][2];
}
}
sub sort_strategy {
my @masterArray = @startArray;
my @top3 = (sort {$b->[2] <=> $a->[2]} @masterArray)[0..2];
}
timethese (5, {
'walk once with sort' => \&discard_sort,
'walk once manual insert' => \&discard_manual,
'sort' => \&sort_strategy});
The results show pretty clearly that the walk-through once solution which I originally proposed is much much faster than a sort on the full array:
Benchmark: timing 5 iterations of sort, walk once manual insert, walk
+once with sort...
sort: 60 wallclock secs (60.37 usr + 0.02 sys = 60.39 CPU) @ 0.08/s
+(n=5)
walk once manual insert: 3 wallclock secs ( 2.79 usr + 0.00 sys = 2
+.79 CPU) @ 1.79/s (n=5)
walk once with sort: 3 wallclock secs ( 2.76 usr + 0.00 sys = 2.76
+CPU) @ 1.81/s (n=5)
On a dataset of 2 million records, the walk through once solution is about 21 times faster than the solution sorting the full array. For the walk through once solution, I tried two variants: using sort to maintain the auxiliary array and doing a manual insertion sort on this auxiliary array. It turns out there is very little difference between these two solutions, but the simpler sort function solution is actually slightly better, so it was not worth the effort of trying a manual insertion sort.
The larger the dataset, the larger the advantage of the walk through once strategy: with 5 million records, the walk-through once solution is 24.6 faster than the full array sort:
Benchmark: timing 5 iterations of sort, walk once manual insert, walk
+once with sort...
sort: 170 wallclock secs (169.96 usr + 0.03 sys = 169.99 CPU) @ 0.03
+/s (n=5)
walk once manual insert: 7 wallclock secs ( 6.90 usr + 0.03 sys = 6
+.93 CPU) @ 0.72/s (n=5)
walk once with sort: 7 wallclock secs ( 6.86 usr + 0.00 sys = 6.86
+CPU) @ 0.73/s (n=5)
Note that, in order to produce realistic benchmark figures for the sort solution, I had to make a copy of the original array, so that the sort would not work on an already sorted array. Part of the recorded durations is therefore wasted making a copy of the array. If we were to subtract this part from the recorded duration, the advantage of the walk through once solution would be even larger. The copying of the 5-million record array takes about 2 seconds, so that the corrected figures would be 168 sec. for the sort, and 5 sec. for the walk through once solutions, meaning that the later is actually about 33 times faster than the sort.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
hats off Laurent. I really thank you all for your replies and time. laurent, i ll suggest my supervisor to send you a check or bitcoins :) <3
| [reply] [Watch: Dir/Any] |
Re: sorting array of arrays reference
by Anonymous Monk on Oct 25, 2013 at 23:15 UTC
|
my @sort;
$sort[$_->[2]] = $_ for @masterArray;
my @top = map {delete $sort[-1]} 1..3;
say Dumper \@top;
Output:
$VAR1 = [
[
'this',
'that',
300000,
'something',
'else'
],
[
'this',
'that',
125638,
'something',
'else'
],
[
'this',
'that',
12563,
'something',
'else'
]
];
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|