Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Get top N rows in each group with DBIx::Class

by 1nickt (Canon)
on Nov 25, 2017 at 15:23 UTC ( [id://1204242]=perlquestion: print w/replies, xml ) Need Help??

1nickt has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I am seeking wisdom from those experienced in the deep magic of DBIx::Class. What I need to do is return a subset of rows from an SQL query, being the top N results in each group.

This problem is actually part of adding a feature to the CPAN Testers API, but for the sake of argument, let's use the standard DBIC metaphor of a music collection. Having started with all CDs, and filtered for musical genre, I want to receive only the most recent N albums by each artist in the result set.

In SQL I can do this with row_number(), or, for MySQL, a workaround with user variables. Something like:

set @artist = ''; set @num = 1; select * from ( select c.name, c.artist, @num := if(@artist = c.artist, @num + 1, 1) as row_number, @artist := s.artist as dummy from cds as c ) as results where results.row_number <= 5;

With DBIC I have already produced a result set spec (but not queried the DB yet) with something like:

my $rs = $schema->resultset('cds'); $rs = $rs->search({ genre => 'rock' });
... and I now would like to do something like:
$rs = $rs->limit_per_artist( 5 );
... to retrieve the most recent 5 CDs for each rocker.

I can see a couple of different ways to do it that involve multiple passes over the table(s), but I really need a single-pass solution like the SQL shown. Any pointers much appreciated.


The way forward always starts with a minimal test.

Replies are listed 'Best First'.
Re: Get top N rows in each group with DBIx::Class
by NetWallah (Canon) on Nov 26, 2017 at 01:35 UTC
    Try using the 'rows' attribute, like this:
    my $rs = $schema->resultset('Artist')->search( undef, { rows => 10, } );
    The Documentation says:

    rows
    Value: $rows

    Specifies the maximum number of rows for direct retrieval or the number of rows per page if the page attribute or method is used.

                    All power corrupts, but we need electricity.

      Hi NetWallah, thanks for your reply. If I understand correctly rows basically implements a LIMIT on the query, which isn't quite what I need here. I'm looking for a limit per artist, or, maybe a better example, I want to query for all tracks in the library but limit the results returned to N per CD.

      On a related note, can you say when it's best to use the rows attribute on a search() call, and when it's best to use a slice() call instead?


      The way forward always starts with a minimal test.
Re: Get top N rows in each group with DBIx::Class
by shadowsong (Pilgrim) on Nov 26, 2017 at 18:30 UTC

    Hi 1nickt,

    Until one of the of the other monks finds us a more elegant solution – try Arbitrary SQL through a custom ResultSource. See e.g. below

    My::Schema

    package My::Schema; use base qw ( DBIx::Class::Schema ); # load My::Schema::Result[Set]::* classes __PACKAGE__->load_namespaces(); 1;

    My::Schema::Result::CD

    package My::Schema::Result::CD; use strict; use warnings; use base qw ( DBIx::Class::Core ); __PACKAGE__->table_class( 'DBIx::Class::ResultSource::View' ); # specify components __PACKAGE__->load_components( qw/ Ordered / ); __PACKAGE__->position_column( 'title' ); # specify table for Result class __PACKAGE__->table( 'cd_view' ); # add columns to 'this' class __PACKAGE__->add_columns( qw/ RN cdid artistid title year / ); # specify primary key(s) __PACKAGE__->set_primary_key( qw/ cdid artistid/ ); # # specify relationships etc.. # # do not attempt to deploy() this view __PACKAGE__->result_source_instance->is_virtual(1); __PACKAGE__->result_source_instance ->view_definition(q[ SELECT ROW_NUMBER() OVER (PARTITION BY artistid ORDER BY cdid) AS RN ,cdid ,artistid ,title ,year FROM cd ]); 1;

    My::Schema::ResultSet::CD

    package My::Schema::ResultSet::CD; use strict; use warnings; use base qw ( DBIx::Class::ResultSet ); 1;

    Demo
    Notice the { 'RN' => { '<=', 2 } } parameter in the search call below? You would change this to 5…

    #!perl -sl use lib qw ( c:\code\perl ); use strict; use warnings; use My::Schema; use constant { DBI_DSN => 'dbi:ODBC:driver={SQL Server};Server=(local);database +=example;Trusted Connection=yes', DBI_USER => undef, DBI_PASS => undef, }; my $eg_schema = My::Schema->connect(DBI_DSN,DBI_USER,DBI_PASS, { RaiseError=>1, AutoCommit=> 1 }); my $cd_rs = $eg_schema->resultset('CD') ->search({ 'RN' => { '<=', 2 } }); print $_->RN.",".$_->cdid.",".$_->title while $_ = $cd_rs->next; __END__

    It'll get the job done but I wasn't able to get too creative and start using CTEs and such; your mileage may vary...

    Cheers,
    Shadowsong

      Hi Shadowsong, thanks very much for your reply and demo. I had seen that part of the doc and it looks promising, but the problem I am facing is that this filter ('N per "group"') must be applied after some other search calls on the RS.

      Continuing the metaphor, this would be something like:

      1. Restrict Artists to given musical genre
      2. Restrict Albums to those release since given date
      3. Restrict results to maximum N Albums per Artist
      In my app I have the first two steps handled with methods in My::Schema::ResultSet::CD:
      sub by_genre { $_[0]->search({ 'me.genre' => $_[1] }); sub since { $_[0]->search({ 'me.released' => { '>=', $_[1] } });
      So the consuming code can do:
      my $rs = $schema->resultset('CD'); if ( my $genre = param->{'genre'} ) { $rs = $rs->by_genre( $genre ); } if ( my $since = param->{'since'} ) { $rs = $rs->since( $since ); }
      I need to add a method that can act upon the existing ResultSet,
      if ( my $limit = param->{'limit'} ) { $rs = $rs->limit_per_artist( $limit ); }
      It's not clear to me that the doc you linked to and demoed can be adapted to search an existing RS, but I'll keep experimenting. Thanks again.


      The way forward always starts with a minimal test.
Re: Get top N rows in each group with DBIx::Class
by chacham (Prior) on Nov 27, 2017 at 12:40 UTC

    I'm not understanding what you are trying to do. Why not use SQL? If you do not want to use a fancy SQL statement in your perl, why not create a view, and just query that instead. BTW, another way to do it, is with UNION ALL and subqueries. That is (simplified):

    select group, (select member where order = 1) union all select group, (select member where order = 2) union all ...

    For only 5 of each, this is a pretty simple way that should work in any database. It can be done without UNION ALL by using sub-queries for your tables, and so on.

      Hi chacham, thank you for your reply.

      I'm not understanding what you are trying to do

      I'm trying to query a ResultSet (i.e. a table) that contains many rows for each "group":

      id | artist | album -------------------- 01 | AA | aa 02 | AA | ab 03 | AA | ac 04 | BB | ba 05 | BB | bb 06 | BB | bc 07 | CC | ca 08 | DD | da 09 | DD | db --------------------
      I want to be able to pass a value for "limit_per_artist" and get as a new ResultSet, e.g. with value "2":
      id | artist | album -------------------- 01 | AA | aa 02 | AA | ab 04 | BB | ba 05 | BB | bb 07 | CC | ca 08 | DD | da 09 | DD | db --------------------

      This is needed because in the real DB there are up to hundreds of rows per "artist" and therefore the vast majority of rows are not wanted.

      Why not use SQL?

      The application (CPAN Testers API) is already well established as a set of DBIC classes, and the preference is to stick to that if possible. I showed a MySQL solution in my OP, and Shadowsong showed another SQL technique in his/her reply.

      why not create a view, and just query that instead?

      That seems to be the most promising path, although as I explained, this query is to take place on a RS after a number of other queries. So I think I would have to create a view for each scenario: i.e. one view for the case where the user wants the RS to be limited to a certain Artist, another when the user wants the RS limited to a certain Genre, etc. (If it were me I would implement all that directly in SQL::Abstract, but as I said this project is built in DBIC.)

      another way to do it, is with UNION ALL and subqueries

      I have seen this technique, but I don't believe it scales to a large number of "groups", and also, it's unknown what the "groups" are and I am trying to avoid doing multiple passes over the table. This seems like it would be a good way to go, constructing the UNION query in Perl, if it was acceptable to first get a list of "groups" with select distinct name from artists or similar.

      I'm coming to think that this situation is an excellent example of the Corion Doctrine ("ORMs are great until you need to do something complicated with them, then they are just an obstacle "), and that I'll have to implement something "clunky" like multiple views, or a multiple-pass solution. It doesn't seem that DBIC is going to be able handle it (and there has been resounding silence on the IRC channel where I've asked for help twice so far).

      I'd be glad to find I've missed something or misunderstood something, please let me know if so. Thanks!


      The way forward always starts with a minimal test.

        I think I would have to create a view for each scenario: i.e. one view for the case where the user wants the RS to be limited to a certain Artist, another when the user wants the RS limited to a certain Genre, etc.

        Being a view is not generated, there is no problem in writing a complex view that would have some insane query plan if run without a where clause, and then querying for each small case needed, resulting in a very quick runtime.

        If the groups are small (artist, genre, etc), just create one view to order them all complete with row number, a column to identify which group it is, and perhaps what ordering is being used. Then, querying it for a specific group, ordering, or limit is simple.

        select Lucutus from Borg where grouping in ('artist', 'genre') and ordering = 'popularity' and row_number <= 5;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1204242]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (2)
As of 2024-04-24 23:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found