The Perl equivalent to a GROUP BY clause in SQL is a hash. You use the parameter of the GROUP BY ... clause as the keys of the hash, and aggregate the value(s) of the rows in the values of the hash.
A rough Perl implementation for a single aggregate (sum) could look like this:
use Scalar::MoreUtils qw(zip); # I'm lazy
sub sum_group_by {
my ($key_columns, $aggregate_columns, $rows) = @_;
my %result;
for my $input_row (@$rows) {
# Create the synthetic key, assuming that no value will contai
+n a \0:
my @key_values = @{ $input_row }{ @$key_columns };
my $key = join "\0", @key_values;
# If this is a new key, create an new result row for it that c
+ontains the key values:
$result{ $key } ||= { zip @$key_columns, @key_values };
for my $aggregate (@$aggregate_columns) {
$result{ $key }->{ $aggregate } += $input_row->{ $aggregat
+e };
};
};
# Return the result as new rows, ordered asciibetically (because I
+'m lazy) by their key:
map { $result{ $_ } } sort keys %result;
};
If you can manage that all your operations return lists sorted by their keys, or can allow an intermediate step that (pre)sorts all inputs according to their keys, you can make the aggregation much simpler by aggregating into a single value and outputting a new row whenever the key (after which you sorted) changes. I have seen no paper or anaylis that measures which is better, pre-sorting the input and then pipelined aggregation or aggregation into a hash and then (potentially sorted) output.
Update: Also consider just using SQL through either DBD::SQLite or DBD::AnyData as a way to manipulate data in a way that is very efficient to write for you, and likely better tested than a hand-rolled aggregation. |