Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

RFC: Array::GroupBy

by kyle (Abbot)
on Jan 06, 2009 at 03:19 UTC ( [id://734334]=perlmeditation: print w/replies, xml ) Need Help??

After reading of bradcathey's plight in Ways to group elements of an AoH, I thought this task might be worth wrapping up in a CPAN module, so I wrote Array::GroupBy. I present it here for the consideration of the monks so that I may grow from your wisdom, creativity, and tomfoolery.

package Array::GroupBy; $VERSION = '0.0.1'; use strict; use warnings; use Carp; use List::Util qw( min max sum ); use Moose; use Moose::Util::TypeConstraints; # These get old value first, then added value my %code_for; { no warnings 'uninitialized'; %code_for = ( sum => { itemproc => sub { shift() + shift() } }, min => { itemproc => sub { defined $_[0] ? min( shift, shift ) : $_ +[1] } }, max => { itemproc => sub { defined $_[0] ? max( shift, shift ) : $_ +[1] } }, avg => { listproc => sub { sum(@_) / scalar @_ } }, count => { itemproc => sub { 1 + shift } }, ); } subtype 'Array::GroupBy::TC::GrouperHow' => as 'HashRef' => where { 0 == grep { !defined || !( # Allow a bare code reference ref $_ eq 'CODE' # allow a hash ref with listproc/itemproc => code ref || ( ref $_ eq ref {} && ( ref $_->{listproc} eq 'CODE' || ref $_->{itemproc} eq 'CODE' ) ) # Allow the name of an item in %code_for || ( ref $_ eq q{} && exists $code_for{ lc $_ } ) ) } values %{$_}; }; has 'group_by' => ( is => 'rw', isa => 'ArrayRef', required => 1, trigger => sub { croak 'empty group_by' if ! @{ $_[1] } }, ); has 'group_how' => ( is => 'rw', isa => 'Array::GroupBy::TC::GrouperHow', trigger => \&_normalize_methods, ); # This is called when group_how is set. # It turns every value into a hash ref with listproc/itemproc sub _normalize_methods { my ( $self, $group_spec ) = @_; foreach my $method ( values %{$group_spec} ) { if ( ref $method eq q{} && exists $code_for{ lc $method } ) { $method = $code_for{ lc $method }; } elsif ( ref $method eq 'CODE' ) { $method = { itemproc => $method }; } } return; } sub group { my $self = shift; my @group_keys = @{ $self->group_by() }; # This is a H(oH)* where each level is a key from group_by # e.g., When group_by = [ 'first', 'last' ] # $grouped{'fred'}{'flintstone'} is a reference to a record with # those keys' aggregates my %grouped; # These are the output records. They're also in the leaves of %gr +ouped my @out; foreach my $item (@_) { # walk out to the right leaf for this record my $leaf = \%grouped; foreach my $gk (@group_keys) { $leaf = ( $leaf->{$item->{$gk}} ||= {} ); } # if this is a new key set, put the new agg. record in @out if ( !keys %{$leaf} ) { push @out, $leaf; } $self->_summarize( $leaf, $item ); } # Apply each list processor to their respective fields my %group_how = %{ $self->group_how() }; FIELD: foreach my $field ( keys %group_how ) { my $listproc = $group_how{$field}->{listproc}; next FIELD if !$listproc; foreach my $group (@out) { $group->{$field} = $listproc->( @{ $group->{$field} || [] +} ); } } return @out; } sub _summarize { my ( $self, $leaf, $item ) = @_; my %group_how = %{ $self->group_how() }; my %group_by = map { $_ => 1 } @{ $self->group_by() }; # fields in either the time or the %group_how my @fields = keys %{{ map { $_ => 1 } keys %{$item}, keys %group_h +ow }}; foreach my $field ( @fields ) { if ( exists $group_by{$field} ) { $leaf->{$field} ||= $item->{$field}; } elsif ( !exists $group_how{$field} ) { $leaf->{$field}++; } elsif ( exists $group_how{$field}->{itemproc} ) { my $proc = $group_how{$field}->{itemproc}; $leaf->{$field} = $proc->( $leaf->{$field}, $item->{$field +} ); } elsif ( exists $group_how{$field}->{listproc} ) { push @{ $leaf->{$field} }, $item->{$field}; } else { croak 'programmer error: impossible condition'; } } return; } 1; # Magic true value required at end of module __END__ =head1 NAME Array::GroupBy - Group an array of records by some key (as with SQL GR +OUP BY) =head1 VERSION This document describes Array::GroupBy version 0.0.1 =head1 SYNOPSIS use Array::GroupBy; my $grouper = Array::GroupBy->new( group_by => [ 'key1', 'key2' ], group_how => { field1 => { listproc => \&list_aggregator }, field2 => 'sum', field3 => \&item_aggregator, field4 => { itemproc => \&item_aggregator }, }, ); my @grouped = $grouper->group( @AoH ); =head1 DESCRIPTION Use Array::GroupBy to turn a list of records into records that have th +eir values aggregated according to some function, grouped by some key. This is meant to be similar to how the "GROUP BY" feature in SQL operates. This documentation assumes some familiarity with that feature. =head1 METHODS =head2 group This is an instance method that takes a list of hashes and returns a list of hashes with the field data grouped according to the configurat +ion in the object's attributes. =head2 group_how This is an accessor for the grouping definition. With no arguments, this returns the grouping definition (a hash refere +nce). With a hash reference argument, this sets the grouping definition. B<Note> that the definition you set is normalized, so it may not come +back out the same way it went in. =head3 Format The format of C<group_how> is a hash where each key is a field name fr +om the input and the keys values define how that field is to be aggregated. The aggregation value can be one of: =over =item A code reference This will be called for each record in the input. It's given two arguments, the old value of the field, and the value to + be added into the aggregate. The return value will be stored as the aggregate value. The aggregator should be ready to get an undef value for either argume +nt. The first call will always have undef as the first argument. The seco +nd argument could be undef if that's the value of some field in the input +. As an example, this would select the longest field value: sub { no warnings 'uninitialized'; if ( defined $_[0] && length $_[0] > length $_[1] ) { return $_[0]; } else { return $_[1]; } } =item The name of a prewritten aggregator These aggregators are already written and can be referenced by name. =over =item B<sum> Sum of the input values. =item B<min> Minimum numeric value of the field. =item B<max> Maximum numeric value of the field. =item B<avg> The average (arithmetic mean) of input values. =item B<count> The number of records with the group_by key fields. =back =item A hash reference with a code reference This aggregator definition looks like this: { listproc => \&code_reference } This is to allow an aggregator that runs over the whole list of field +values found rather than aggregating them one at a time. In this case, Array::GroupBy will build up an array of all the field values and pass them to the aggregator for each group found after scanning the input r +ecords. You can also specify an item-by-item processor this way. { itemproc => \&item_processor } =back =head3 Defaults The default behavior for any field that's not in the C<group_by> attri +bute and does not have any aggregator defined is 'count'. Every aggregator is called for every record. If you define an aggrega +tor for a field that does not appear in any record, it will get C<undef> values on every call, and the field I<will> appear in the output recor +ds. This can be a convenient way of getting a "count(*)" field. my $grouper = Array::GroupBy->new( group_by => [ 'name' ], group_how => { counter => 'count' } ); That aggregator would work similarly to this SQL query: SELECT name,count(*) AS counter FROM t GROUP BY name; =head2 group_by This is an accessor for the list of keys used for grouping. With no arguments, this returns a reference to an array of keys. Call it with a reference to an array of strings to set the attribute. =head2 meta This is a method added by Moose which provides access to the current class's metaclass. =head1 DIAGNOSTICS =over =item empty group_by This is the error you get if you try to set the group_by attribute to an empty array. =item Attribute (I<x>) does not pass the type constraint This error is generated by Moose, so it may vary with different versio +ns of Moose. This is the error you get when you try to set attribute I<x +> to a value that's not allowed by the type constraints on it. =back =head1 DEPENDENCIES =over =item Moose I used Moose 0.17 when writing this. I expect any later version shoul +d work. =back Every other module I used is a core module. =head1 BUGS AND LIMITATIONS No bugs have been reported. Please report any bugs or feature requests to C<bug-array-groupby@rt.cpan.org>, or through the web interface at L<http://rt.cpan.org>. =over =item There's no way to aggregate one field multiple ways For example, if you want the min I<and> max of some field, the interfa +ce doesn't allow a simple way to get them both. =item If you catch the exception thrown by a mutator, the object may b +e in an odd state. I'd expect attributes to hold their previous values when setting fails + this way, but apparently they don't. This behavior is dependent on Moose, +so maybe it's different in later versions. =back =head1 AUTHOR Kyle Hasselbacher C<< <kyleha@gmail.com> >> =head1 LICENCE AND COPYRIGHT Copyright (c) 2009, Kyle Hasselbacher C<< <kyleha@gmail.com> >>. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See L<perlartistic>. =head1 DISCLAIMER OF WARRANTY BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WH +EN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. TH +E ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Things that particularly concern me:

  • The name. Having written Array::GroupBy, I think it should be called Array::Grouper. Then again, maybe "array" isn't right since it operates more specifically on an AoH, and maybe the monks have an even better idea.
  • The documentation. In my brief search for a way to describe what GROUP BY does, one of the better "descriptions" said it's hard to describe and it's best to learn by example. I punted on that, and when writing the docs I often got the feeling I wasn't describing things very well. Any direction here would be appreciated.
  • My Moose use. This is the first thing I've done with Moose besides play.

Thank you for your thoughts!

Replies are listed 'Best First'.
Re: RFC: Array::GroupBy
by jdporter (Paladin) on Jan 06, 2009 at 04:37 UTC

      Can you show an example of how you'd use those modules to do what mine does? From the documentation I'm looking at, it appears DBD::Sponge is used for making some data appear to be from some SQL, not for doing any kind of transformation.

      use DBI; my $sponge = DBI->connect( 'dbi:Sponge:', '', '', { RaiseError => 1 } +); my $sth = $sponge->prepare( 'SELECT name,count(*) FROM ??? GROUP BY na +me', { NAME => [ 'name', 'count' ], rows => [ [ 'fred ', 10 ], [ 'john ', 20 ], [ 'jdporter', 30 ], [ 'jdporter', 40 ], ] } ); while ( my @row = $sth->fetchrow_array ) { print join ', ', @row; print "\n"; } __END__ fred , 10 john , 20 jdporter, 30 jdporter, 40

        You're right, my bad. Despite allegedly supporting SQL::Statement::Syntax, in fact DBD::Sponge essentially ignores the statement and does a plain ol' SELECT *.... and the only use I can envision for that is as a bridge between your perl data and some black-box module which requires a statement handle.

        The documentation of DBD::Sponge's prepare() is pretty unclear on this issue:

        The $statement here is an arbitrary statement or name you want to provide as identity of your data. If you're using DBI::Profile it will appear in the profile data.

        Generally it's expected that you are preparing a statement handle as if a select statement happened.
        huh?

        Anyway, sorry for the red herring.

        Update: I checked into a couple of DBD modules which allow to access in-memory perl data structures as sources for DBIDBD::AnyData and DBD::RAM (both, like SQL::Statement and other parts of the DBIverse, by jZed). Alas, neither supports GROUP BY!

        Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.
Re: RFC: Array::GroupBy (Synopsis)
by lodin (Hermit) on Jan 06, 2009 at 10:58 UTC

    It looks useful, but after glancing through your module I still don't know how to use it. I don't know how SQL GROUP BY works, and you say that's assumed, so I can't really complain. :-)

    However, I think it's a good idea to have a full example in the synopsis. With full example I mean that it should compile and run, and some sensible output/results should also be included when applicable. Usually I get a good idea of what the code does and how I can use it by just looking at the input/transform/output triplet.

    This is particularly important for this kind of module that transforms one data structure to another. You show the transform, but you don't show what goes in and what comes out. What is @AOH, and how does "key1" and "field1" relate to @AOH?

    While the synopsis should just highlight what the module does, I often appreciate one or two slightly more elaborate (and possibly realistic) examples of the code at the end (under e.g. =head1 EXAMPLES).

    Returning to including output in the synopsis; I often use __END__ to separate the code from the output (see examples at CPAN). I've written a test module for that which I've pondered for release. Basically, the synopsis becomes a part of the test suite. This makes sure that there are no typos in the synopsis and that it is up to date. If anyone shows any interest I might write up an RFC.

    lodin

      I think an examples section is a good idea, but I wouldn't want a full working example in the synopsis. The data structures it's working on tend to be large on the page, and I want to keep the synopsis short. For example, here's the shortest of the functional test cases I have so far:

Re: RFC: Array::GroupBy
by grinder (Bishop) on Jan 06, 2009 at 13:39 UTC

    Interesting. What does using Moose buy you in this particular case? Or to put it another way, how much make-work code would you have to add if you didn't use it?

    This is more than an idle question, because if I really needed your module for the very specific problem it solves, I think I'd be a bit upset if I discovered that it pulled in a comparatively huge OO framework. This is in no way meant as disrepect to Moose. It's more a question of ratios: the Moose tarball is 250Kb, not counting its twenty or so dependencies. Sledgehammer, meet egg.

    Would it take twenty lines to implement this as a dependency-free module? A hundred lines? Would it be painful? I guess it comes down to whether one considers Moose as a reasonable building block for small, special-purpose modules.

    • another intruder with the mooring in the heart of the Perl

      Sledgehammer, meet egg.

      Ok.... But how many more such eggs will you need to see before you decide that this is one sledgehammer worth having in your toolbox?

        Why, a dozen of course!

        less flippant update: whether I think Moose is worthy of my toolbox (and I do) is beside the point. The question is whether you can yet decide that it is for everyone else.

        • another intruder with the mooring in the heart of the Perl

      I'll start by saying that I expect Moose to be ubiquitous. If it's not now, it will be. Also, anyone installing something as obscure as Array::GroupBy is sure to have something as mainstream as Moose.

      Anyway, what I got in this particular case was mostly Moose::Util::TypeConstraints. At object creation, and any time an attribute is set, the attributes are checked against the constraints I set. If I weren't using Moose, I'd be using Params::Validate. Aside from that, I get the usual OO framework stuff: a constructor and attribute accessors.

      The make-work code I'm saved from writing is, again, mostly the input validation stuff. My version of Moose is a ways behind the current, so I wound up doing some of this myself in order to get the detail I wanted. I found it pretty tedious, and the results challenge comprehension. With a newer version of Moose, this was easier to do and easier to read. Without Moose (or some module), it could have been more work than writing the actual functionality of the module.

        You know, I'm happy to point Moose out to non-Perl hackers, it usually shuts them up about how much of a mess Perl's OO is. They tend to glance at an example and say "Oh this is nice! This actually makes sense". On the other hand "anyone installing something as obscure as Array::GroupBy is sure to have something as mainstream as Moose" sounds a lot like wishful thinking.

        If you do release this to CPAN, I think Params::Validate as a prerequisite is a much more viable proposition. The fact that there's a very useful analog over in Moose land makes me think of someone saying "Hmm. I need to put my cup down somewhere. I know, I'll go out to my car and use the cup-holder."

        • another intruder with the mooring in the heart of the Perl

        I'll start by saying that I expect Moose to be ubiquitous. If it's not now, it will be.
        Moose has a huge loading time that makes it unsuitable for many tasks. On my (somewhat old) computer...
        $ time perl -e 'package foo; use Moose' real 0m1.052s user 0m0.880s sys 0m0.044s $ time perl -e 'package foo' real 0m0.008s user 0m0.000s sys 0m0.008s
        I see no problem using Moose inside big long lived applications or frameworks like Catalyst and alike. But using it in a generic CPAN module unnecessarily limits where that module can be used.

        BTW...

        $ time perl -e 'package foo; use Params::Validate' real 0m0.066s user 0m0.028s sys 0m0.008s
Re: RFC: Array::GroupBy
by Jenda (Abbot) on Jan 06, 2009 at 16:26 UTC

    I do not see a reason to disallow empty group_by.

    What about passing the whole records to the custom (code reference) agregators for some specially named group_how options? Eg.

    ... group_how => { '&foo' => sub { my ( $result, $rec) = @_; return $result + $rec->{a} * $rec->{b} }, ...
    producing a new attribute named 'foo' in the resulting records.

Re: RFC: Array::GroupBy
by jdporter (Paladin) on Jan 07, 2009 at 16:14 UTC

    Not to belittle your efforts in any way, but IMHO your solution is far too complicated and hard to comprehend, and I believe this is because it tries to do too much. IMHO, the grouping function should be separate from any processing/analysis of the groups. If you simply return the groups to the caller, she can do whatever processing on them in her own way.

    Here is how I would probably approach the problem:

    sub find_groups(&\@) { my( $keymap, $rows_ar ) = @_; my %groups; for ( @$rows_ar ) { push @{ $groups{ &$keymap } }, $_; } \%groups }
    Example: Show a count(*) of distinct values of the 'name' column:
    my $groups_hr = find_groups { $_->{'name'} } @rows; for my $k ( sort keys %$groups_hr ) { print "$k\t" . @{$groups_hr->{$k}} . "\n"; }
    Example: Group by a key consisting of multiple columns concatenated:
    find_groups { join $;, @{$_}{qw( name game score )} } @rows;
    Example: Group by a key derived from the row by an arbitrarily complex function:
    find_groups { $_->{'name'} eq 'foo' ? 1 : $_->{'game'} =~ /bar/ ? 2 : $_->{'score'} >= 500 ? 3 : 0 } @rows;
    Example: Find numeric mean within each group of the values in the 'score' field:
    use List::Util qw( sum ); my $groups_hr = find_groups { $_->{'name'} } @rows; for my $k ( sort keys %$groups_hr ) { my $avg = sum( map { $_->{'score'} } @{$groups_hr->{$k}} ) / @{$gr +oups_hr->{$k}}; print "$k\t$avg\n"; }
    Sure, you could argue that
    map { $_->{'score'} } @rows
    is an ugly way of extracting a column from a AoH table... but at least it's perlish. You could easily wrap that in a function if you wanted to.

    Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.

      Not being perlish is surely the whole point! This module seems to be aimed at programmers who are specifically looking for SQLish behaviour. Which is exactly what it provides.

      Related to which, I'd say definitely keep the GroupBy name: it serves as a nice warning to SQL-haters, and a nice beacon to SQL-lovers, as to the type of solution this module offers. :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://734334]
Approved by planetscape
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-10-07 16:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The PerlMonks site front end has:





    Results (44 votes). Check out past polls.

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.