Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Too Many IDs

by The_Dj (Sexton)
on Jan 09, 2020 at 06:39 UTC ( #11111226=perlquestion: print w/replies, xml ) Need Help??

The_Dj has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Monks.

For reasons probably best described as DBI, I have data like this:

%dat_by_id = ( 1=>{ id=>1, sn=>'a', more=>'foo' }, 2=>{ id=>2, sn=>'b', more=>'bar' } );

You get the idea.

Now, for reasons best described as argh! I also do:

%dat_by_sn=(map { $_->{sn} => $_ } values %dat_by_id);

Sadly I do need to reference entries be either field throughout the rest of the code.

Is there a better way? (My dataset runs to a few million records. Big but not too big.)

I don't (currently) need a 3rd,4th,...nth field, but it's only a matter of tiem untill someone adds a UID field, or something else to compicate my life

Replies are listed 'Best First'.
Re: Too Many IDs
by kcott (Bishop) on Jan 09, 2020 at 08:21 UTC

    G'day The_Dj,

    Although not stated, I'm assuming all id, sn, etc. values are unique. If that's not the case, neither your current solution nor my alternative suggestion will work properly.

    Instead of recreating the entire hash multiple times, consider just having a single hash with all the data and then simple mappings of sn to id (extending for future requirements).

    Here's a quick example:

    #!/usr/bin/env perl use strict; use warnings; my %dat_by_id = ( 1 => {id=> 1, sn => 'a', more => 'foo'}, 2 => {id=> 2, sn => 'b', more => 'bar'}, ); my %map_sn_to_id = map +($_->{sn} => $_->{id}), values %dat_by_id; print "SN for ID[1]: $dat_by_id{1}{sn}\n"; print "ID for SN[b]: $dat_by_id{$map_sn_to_id{b}}{id}\n"; print "MORE for SN[a]: $dat_by_id{$map_sn_to_id{a}}{more}\n"; # Subsequent requirements, e.g. my %map_more_to_id = map +($_->{more} => $_->{id}), values %dat_by_id; print "ID for MORE[foo]: $dat_by_id{$map_more_to_id{foo}}{id}\n"; print "SN for MORE[bar]: $dat_by_id{$map_more_to_id{bar}}{sn}\n";

    Output:

    SN for ID[1]: a ID for SN[b]: 2 MORE for SN[a]: foo ID for MORE[foo]: 1 SN for MORE[bar]: b

    Having a single data source will reduce the chances of errors and should make maintenance and debugging (if necessary) easier.

    I see you've used "map BLOCK LIST" and I'm aware that's considered a Best Practice; however, "map EXPR, LIST" is faster and may make a difference, especially when you're dealing with millions of data elements. Use Benchmark to test. See map for more on these two forms as well as an explanation of the unary plus, "map +(...", I used (if you're unfamiliar with that syntax).

    I've only shown a barebones technique. For production usage, I'd suggest setting up a series of functions, e.g. get_id_for_sn($sn), instead of having to continually hard-code an equivalent $dat_by_id{$map_sn_to_id{$sn}}{id}.

    — Ken

      Hi, Ken.

      Yeah, I should have mentioned that both sn and id are* unique keys.

      And thanks for the benchmark pointers. I'll keep this in mind when I get to optimizing.

      Sadly I tend to loose most of my runtime to DB calls and some 3rd party .exe's, but I will give map a close scrutiny.

      * Except one isn't really, but is treated as such for other reasons. My job can be 'fun'

        If parts of your application (DB, EXEs, etc.) are taking seconds or milliseconds to run, spending time optimising map to save a few micro- or nanoseconds is unlikely to be worth the effort.

        Another consideration is whether this is a short-lived application that's run multiple times or a long-lived application that's run with multiple iterations.

        There are some JIT (just-in-time) possibilities that may be worth consideration. If you end up with a lot of get_X_for_Y() subs, and some are only called infrequently, you can create mappings just when they're needed; something like this using state (which requires v5.10):

        { my %dat_by_id = ...; sub get_X_for_Y { my ($Y) = @_; state $map_Y_to_id = { map +($_->{Y} => $_->{id}), values %dat_by_id }; return $dat_by_id{$map_Y_to_id->{$Y}}{X}; } sub get_X_for_Z { ... } sub get_V_for_W { ... } ... }

        Note that the anonymous block gives %dat_by_id lexical scope such that it is only visible to the subs; while the subs themselves are visible to, and accessible from, the entire script. This prevents inadvertent changes to %dat_by_id which could introduce bugs which are hard to track down.

        Also note that the placement of the above code within your script would be important. The assignment to %dat_by_id should occur before any calls to get_?_for_?() are made. An alternative to this would be to use some combination of BEGIN, INIT, etc. blocks; I don't know enough about your code to comment further on this; see perlmod: BEGIN, UNITCHECK, CHECK, INIT and END for more about these. There would, no doubt, be other ways to handle this but, as previously stated, I don't know enough about your code to offer further advice.

        — Ken

Re: Too Many IDs
by choroba (Archbishop) on Jan 09, 2020 at 08:08 UTC
    Update: This answer isn't really helpful, sorry. Your solutions stores the same references under the sn's, so it should occupy memory comparable to my solution.

    If your concern is memory, you can store just the "sn to id" map instead of storing all the data twice.

    my %sn_to_id = map { $dat_by_id{$_}{sn} => $_ } keys %dat_by_id;

    You then need to nest the references to get the value:

    my $sn = 'b'; my $more = $dat_by_id{ $sn_to_id{$sn} }{more};

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Where is the benefit to the OP's code, where the values are just the identical hash refs and no flat copies?

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Too Many IDs
by LanX (Archbishop) on Jan 09, 2020 at 11:34 UTC
    > Is there a better way?

    Depends, there are certainly cases where it was certainly better to keep the raw table in the DB and you query ID or SN individually via SQL.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      "Depends" is about as good of an answer as I could think of, too. (Whenever someone uses a comparative like "better", I usually insist on "better, for what?" if I'm going to spend time answering, and start thinking "is this an X-Y question?", etc.) I would say that I don't usually use selectall_hashref like that. Often I'd go with selectall_arrayref($sql, {Slice => {}}, @bind) so you iterate over the hrefs directly to build the aggregations (hashes). And end up, like you say, splitting things up and doing subqueries. Depends..

        Yep.

        I will traverse all the data

        The data lives on another server

        Pulling it all at once is just faster*

        I acutlly have both selectall_hashref and selectall_arrayref($sql, {Slice => {}}) in my code, each where it is best (I believe)

        * I should probably benchmark that

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11111226]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2020-01-27 18:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?