http://www.perlmonks.org?node_id=938038

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I have a hash of several hundred arrays. The keys are integers, the arrays can be up to several thousand items each.

Eg,

%hash = ( 3 => [ ... ], 4 => [ ... ], 6 => [ ... ], 7 => [ ... ], 8 => [ ... ], 11 => [ ... ], ... );

The keys need to be processed in order -- say small to large -- in groups of a fixed integer intervals.

Ie. If the interval were 4, then first all the items in the arrays keyed by 3,4,6 & 7 must be processed together. Then all those in the arrays keyed by 4, 6, 7 & 8; then those keyed by 6,7 & 8; then 7, 8, 11.

But, the processing within each group involves processing every item against every other item within the group. And when the first group (3,4,6,7) has been processed, all those have been processed against one another already, so once the 3 array is removed and the 8 array is added, all that need to be processed is the 8 array against those in 4, 6 & 7 arrays.

Then the 4 array gets dropped and there is nothing to add, so no work to do. Then the 6 array gets dropped and the 11 array need to be processed against the 7 & 8 arrays.

I hope that is a clear explanation.

The question is, how to code that?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: How to code this?
by GrandFather (Saint) on Nov 15, 2011 at 00:09 UTC

    Something like this:

    #!/usr/lib/perl use strict; use warnings; my $kPoolSize = 2; my %hash = ( 3 => [ 1 .. 2 ], 4 => [ 2 .. 2 ], 6 => [ 4 .. 5 ], 7 => [ 2 .. 6 ], 8 => [ 1 .. 3 ], 11 => [ 5 .. 10 ], ); my @indexes = sort {$a <=> $b} keys %hash; my @pool = splice @indexes, 0, $kPoolSize; while (@pool == $kPoolSize) { my $poolSum; for my $poolEntry (@pool) { $poolSum += $_ for @{$hash{$poolEntry}}; } print "Pool @pool sum: $poolSum\n"; shift @pool; push @pool, shift @indexes if @indexes; }

    Prints:

    Pool 3 4 sum: 5 Pool 4 6 sum: 11 Pool 6 7 sum: 29 Pool 7 8 sum: 26 Pool 8 11 sum: 51
    True laziness is hard work

      The interval is not the number of arrays within the group, it is the interval covered by the integer keys to the arrays in the group.

      Therefore, with an interval of 2, 4 & 6 are not in the same group. Neither are 8 & 11.

      Also, to avoid redoing work already done, when a new array is added to the group, it is necessary not only to discard any old arrays that are no longer a part of the group (Ie. no longer covered by the interval), but distinguish between those arrays that remain (from the previous group), and the new one that has been added.

      It is quite tricky to even describe, let alone code!


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Oh, I missed that! Something like this then:

        #!/usr/lib/perl use strict; use warnings; my $kPoolSpan = 2; my %hash = ( 3 => [1 .. 2], 4 => [2 .. 2], 6 => [4 .. 5], 7 => [2 .. 6], 8 => [1 .. 3], 11 => [5 .. 10], ); my @indexes = sort {$a <=> $b} keys %hash; my @pool; while (@indexes) { my $oldPoolSize = @pool; push @pool, shift @indexes if !@pool; push @pool, shift @indexes while @indexes && $pool[0] + $kPoolSpan >= $indexes[0]; my $poolSum; my @new = @pool[$oldPoolSize .. $#pool]; for my $poolEntry (@pool) { $poolSum += $_ for @{$hash{$poolEntry}}; } print "Pool @pool sum (added @new): $poolSum\n"; shift @pool while @indexes && @pool && $pool[0] + $kPoolSpan < $in +dexes[0]; }

        Prints:

        Pool 3 4 sum (added 3 4): 5 Pool 4 6 sum (added 6): 11 Pool 6 7 8 sum (added 7 8): 35 Pool 11 sum (added 11): 45
        True laziness is hard work

        It seems to me, therefore, that the event of “a new array is added to the group” is .. “somehow, quite significant.”   That is to say, the definition of “the problem that is to be solved” is in some way dynamic.   (Perhaps it is in some way reliant upon what has already been solved?)

        I cordially suggest that this might well prove to be a crux of the problem:   that the most-desirable solution to this problem is not one that can be resolved statically, but rather, one that is in some way either “dependent on,” or “rather profoundly affected by,” adding-and-removing.   Perhaps this is what makes the problem “difficult to describe.”

        Is this problem, one of such a nature that you could describe “an optimal solution” without considering “(therefore...?) what happens next?”   If the answer to that question is, “no,” then ... aye, the problem is most difficult to describe.   In that case, the outcome of a particular solution is very heavily weighted by what its consequences will be, and that becomes an altogether different class of problem.

      No. The size of @pool changes - what's given is the $pool[-1] - $pool[0] <= 4.
Re: How to code this?
by CountZero (Bishop) on Nov 14, 2011 at 23:09 UTC
    Perhaps an Array of Arrays is a better kind of storage?

    There will be "holes" in your AoA of course, but you can use a sliding window of (for example) length 4 and it will be easy to see what needs to be dropped or added. The low limit needs to be dropped and the upper limit needs to be added and it is easy to check if they contain anything or not.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Howdy!

      Given subsequent clarifications, I think this may be a very effective approach. If you look at the first n items in the array, you will get the entire group each time. Once you have handled a set, shift the first element off and any subsequent empty elements. Lather, rinse, repeat.

      yours,
      Michael
Re: How to code this?
by aaron_baugher (Curate) on Nov 15, 2011 at 14:18 UTC

    I thought I understood the requirements, but some of the later comments and the talk of "dropping" arrays are making me wonder again. You said "the processing within each group involves processing every item against every other item within the group." So if an interval happens to span 6,7,8, you want to process 6&7, 6&8, and 7&8, right? But you want to make sure no pair is processed more than once?

    If that's correct, then I'd picture it as a sliding window of size $interval+1, and I'd use a hash to keep track of which pairs of items have been processed. In pseudo-code, something like this:

    create a hash to hold processed pairs, with a key like "3 4" to indicate that 3 has been processed against 4 get minimum and maximum key values (3 & 11 in this case) loop from minimum to maximum minus interval if interval contains more than one item for each pair of items in interval check hash to see if the pair has been processed if not process the pair mark the pair in the hash

    No need to "drop" anything that I can see, since the sliding interval will move along to the next set. In your example, with an interval of 4, this would happen:

    minimum = 3, maximum = 11, so loop from 3 to 7 first interval is 3-7, so includes 3,4,6,7 process 3 & 4 process 3 & 6 process 3 & 7 process 4 & 6 process 4 & 7 process 6 & 7 next interval is 4-8, so includes 4,6,7,8 skip 4 & 6, already done skip 4 & 7, already done process 4 & 8 skip 6 & 7, already done process 6 & 8 process 7 & 8 next interval is 5-9, so includes 6,7,8 skip 6 & 7, already done skip 6 & 8, already done skip 7 & 8, already done next interval is 6-10, so includes 6,7,8 skip 6 & 7, already done skip 6 & 8, already done skip 7 & 8, already done next interval is 7-11, so includes 7,8,11 skip 7 & 8, already done process 7 & 11 process 8 & 11

    Sound about right? Or am I completely missing something that makes this harder than I'm thinking?

    Aaron B.
    My Woefully Neglected Blog, where I occasionally mention Perl.

Re: How to code this?
by choroba (Cardinal) on Nov 15, 2011 at 01:26 UTC
Re: How to code this?
by sundialsvc4 (Abbot) on Nov 14, 2011 at 23:38 UTC

    I’ll assume that the volume of data in question is not such that storing the data “in memory” is not a potential problem, and therefore that the possible suggestion of using SQL JOINs is not useful or necessary.

    I would start by organizing the data into a tree, by group-number.   So you’d have something like this:

    %tree = ( 1 => { 3 => [ ... ], }, 4 => { 4 => [ ... ], 6 => [ ... ], 7 => [ ... ], }, 8 => { 8 => [ ... ], 11 => [ ... ], } ... );

    I admit that the requirement is not fully clear to me, especially when you start saying things like “the 3 array is removed and the 8 array is added.”   But even so, the general strategy ought to work, thanks in part to “references” making it possible for the same list to (if necessary) literally occupy more than one place in the tree at the same time.   You’d just wind up with more than one arrayref pointing to the same array-object, at different places within the tree, and Perl would see to it that the arrayref didn’t get garbage-collected until the last reference was consumed.

Re: How to code this?
by Anonymous Monk on Nov 15, 2011 at 01:30 UTC

      And your suggestion is?

      Update: 2 hours later and I see you couldn't think of anything better either. Vindication.

        How about, "Manipulate groups of arrays in sequence".

        I am never Anonymous Monk, but I, too, was shocked to see your name associated with this vague title.

      I appreciate the thought fellow anonymous, but since the problem is quite tricky to even describe, let alone code! , how would you improve the title?

Re: How to code this?
by sundialsvc4 (Abbot) on Nov 15, 2011 at 12:00 UTC

    Since I will politely assume that you do want “another set of eyes” looking with you at this problem, and given that (as you say) the problem is difficult to describe, let me try again.

    This appears to me to fundamentally be a problem involving groups, the members of which happen to be “very large arrays” and which are identified by a numerical key that falls within the appointed (but arbitrary) boundaries of that group.   The membership of each group changes constantly, and it is significant to the problem that the status of each group-member (“recently added”) is known.   You said that the fact that a new item has been added to a group is significant in deciding when, whether, and how to process it.

    (What I am calling “an item” consists of:   an arrayref which is the list of items (and which may or may not be the only such reference that exists); a key-value which identifies it; perhaps a group-number; and a status-indicator e.g. “recently added.”)

    Storage management is an issue, of course, because you don’t need large things taking up memory when you no longer have any use for them, but I see this as an issue that Perl already knows how to take care of for itself.     I also observed how this enables an arrayref to be in multiple places within the structure at one time.

    Perhaps the boundaries of a group are not fixed.   Perhaps with an arbitrary group-size of, let us say, “4,” a “group” is defined by all elements in the inclusive range [n ... n-(4-1)] for any key n that we happen to stumble-upon in the data.   But, once a group has been identified and its membership has been determined, so far it appears that the subsequent processing that takes place always happens within that one group.

    I think that what you’re saying is that, once you have identified a group and determined what’s in it, the actual processing of that group is easy to describe.

    Am I still on the right track?

    Is this the sort of problem that might be usefully described as involving a “work to-do list?”   Say, when you “add a new group,” this establishes it as work-to-do and so you throw it into the bucket (perhaps it is a sorted list ...) and that is part of what determines what the algorithm is to do next.

    I know I’m shooting in the dark here, but ... “HTH.”