Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Assemble times into ranges

by Dominus (Parson)
on May 05, 2001 at 22:31 UTC ( #78271=perlmeditation: print w/replies, xml ) Need Help??

A guy posted a question to p5p today. It was in the wrong place (p5p is not for questions about programming in Perl) so I sent him away, but it was still an interesting question.

He has a hash %interval where the keys are start times and the values are corresponding end times. He also has an array @time of interesting times. He want to know which elements of @time occur inside one of the intervals.

Here's something like his original code:

while (($start, $end) = each %interval) { for $time (@time) { if ($time >= $start && $time < $end) { + push @match, $time; } } }

He has a problem: This is too slow. %interval contains 10,000 intervals, and @time contains 5,000 times. He needs it to be faster.

I can see that improvements might be possible on two levels. One could improve the code without changing the data structures. For example, you could sort keys %interval and @time and then go through the @time items in sorted order, ignoring the start-times that were too late. You could replace the inner loop with grep.

Probably a better approach would be to improve the data structures. I can't think offhand of a good way to store the information, however. (One thing that does come to mind is that he should preprocess the intervals, and merge any intervals that overlap into one large interval. Then he could make a single pass over both lists.)

Mark Dominus
Perl Paraphernalia

Replies are listed 'Best First'.
Re (tilly) 1: Assemble times into ranges
by tilly (Archbishop) on May 06, 2001 at 01:15 UTC
    I think a better data structure is the answer.

    He should use DB_File with a BTREE and specify a custom sort order if it makes sense. In there he should put the times that he changes from inside to outside of any interval, and what he did. He can then use the OO interface, and use the seq method (with the R_CURSOR flag) to find the next action efficiently, and from that he can figure out his current state.

    Here is example code demonstrating this that goes one step further. Note that the API I am coding to has some things I don't like (sorry, I don't like destructive methods) but I am more than willing to put up with it for the functionality provided:

    #!/usr/bin/perl -w use strict; use DB_File; # The original data structure my %interval = qw( 5 50 20 30 70 90 120 130 ); # The times my @times = qw(4 9 30 98 125 500); # Let us rearrange the interval information # into events consisting of a time/action. my @events = map {[$_, 1]} keys %interval; push @events, map {[$_, -1]} values %interval; # Let us turn this array into something we can # do a fast lookup on... # First we need a useful comparison order: $DB_BTREE->{'compare'} = sub {$_[0] <=> $_[1]}; # Next we need a hash to store information in my %interval_end; tie( %interval_end, 'DB_File', undef, O_RDWR|O_CREAT, 0640, $DB_BTREE ) or die "Cannot tie hash: $!"; # Now store the information. my $nesting = 0; foreach my $event (sort {$a->[0] <=> $b->[0]} @events) { my ($time, $action) = @$event; # Careful, multiple intervals may share a boundary... unless (exists $interval_end{$time}) { $interval_end{$time} = $nesting; } $nesting += $action; } # Now we get the tied object so we can do our fast lookup. my $interval_depth = tied %interval_end; # And now let us get our answers: foreach my $time (@times) { my $ended = $time; my $depth = 0; $interval_depth->seq($ended, $depth, R_CURSOR); print "$time is in an interval ending at $ended of depth $depth\n"; } __END__ Output: 4 is in an interval ending at 5 of depth 0 9 is in an interval ending at 20 of depth 1 30 is in an interval ending at 30 of depth 2 98 is in an interval ending at 120 of depth 0 125 is in an interval ending at 130 of depth 1 500 is in an interval ending at 500 of depth 0
    Note that the final entry gave a somewhat non-intuitive answer. And there could be more error checking. Plus all intervals have been made open on the left and closed on the right. But oh well.

    (This should be efficient on his original data set.)

    There is more than a little irony in my pointing out the capabilities of BTrees to Dominus. The first place that I heard of them was this article. (Which I enjoyed quite a bit.)

Re (tilly) 1 (duh): Assemble times into ranges
by tilly (Archbishop) on May 06, 2001 at 04:11 UTC
    The BTree answer was horrible overkill.

    It is a great idea if you needed to process the timepoints as they come in. But a naive sorting solution is much simpler and gets better control. After all we have to sort the beginning and end of the intervals Just toss in the timepoints as well. For instance if you want to get closed intervals, you can do it like this:

    #!/usr/bin/perl -w use strict; # The original data structure my %interval = qw( 5 50 20 30 70 90 120 130 ); # The times my @times = qw(4 9 30 98 125 500); # Current nesting depth my $depth = 0; # Let us rearrange the interval information # into events consisting of a time/action my @events = map {[$_, 0, sub {++$depth}]} keys %interval; push @events, map {[$_, 2, sub {--$depth}]} values %interval; push @events, map { my $time = $_; [$_, 1, sub {print "Time $time is at depth $depth\n"}] } @times; # Sort it and read out the answer: foreach my $event ( sort { $a->[0] <=> $b->[0] or $a->[1] <=> $b->[1] } @events ) { $event->[2]->(); }
    (And, of course, you can make the whole thing much cleaner, this is just proof of concept code.)
      More out of curiousity's sake: with this structure, is it possible to determine *which* intervals the events are in (as the original problem seems to ask?). You definitely can determine how many intervals each event falls into via the depth? As per my suggestion, I think you need at least another data structure as you walk the @events array to store which intervals are active.

      Dr. Michael K. Neylon - || "You've left the lens cap of your mind on again, Pinky" - The Brain

        The depth variable tells you how many events fall in it. If you want to instead know the intervals, you could use a hash there instead. Click for code if you want an example...
Re: Assemble times into ranges
by Masem (Monsignor) on May 05, 2001 at 22:58 UTC
    This idea requires a bit of set up, but it only goes through the lists twice (eg O(n+m), instead of O(n*m)) with one sort (O(nlogn)).

    For each iterval, add 2 items to a hash, one with key:value of the start time and a unique interger (eg $i++); the second with the end time and the negative of that integer.

    For each even, add one item to the same hash, key the event time, and value 0.

    Grab the keys of the hash, and sort on time, with secondary sort on the ascending order of the absolute value of the hash value Eg, for three values with teh same key, the order would be something like 0, -3, 5.

    Now, just iterate along the keys. If the value of a key is a positive integer, then in another hash (interval_hash), set that key to some value (eg, if the integer is 3, the interval_hash would be key 3, value 1). If a negative interger, undef that value from the interval_hash. If zero, then in a third hash (occured_hash), for each key currently in interval_hash, push the time onto an array for that interval. So, if time is currently 10, and interval_hash contains keys 3, 4, and 7, then for occured_hash keys 3,4, and 7, you'd push 10 onto the arrays that are referenced by this.

    When you are all done, then the values for a given interval key in occured_hash will be an array with all the times that fall into those events.

    Again, this should be faster based on the order calculations, but you'd have to try it out and make sure that the little things like hash manipulations don't slow it down too much (it shouldn't).

    UPDATE: ok, the trick on the secondary sorting (dealing with intervals and events that occur at the same time), partially has to do if you want inclusive or exclusive intervals. If you want inclusive, then the secondary sort should be in decending, non-absolute order (eg 5, 0, -3 ), such that you 'initiate' new intervals before you process the events, and then 'deactivate' any intervals after processing. If you want 'exclusive', then sort in increasing order (-3, 0, 5) to delete intervals before dealing with events, then creating new ones.

    Dr. Michael K. Neylon - || "You've left the lens cap of your mind on again, Pinky" - The Brain
Re: Assemble times into ranges
by dws (Chancellor) on May 05, 2001 at 23:26 UTC
    Probably a better approach would be to improve the data structures.

    How about a brute force space-for-time tradeoff?

    A 3.8Mb bitmap can represent an entire year's worth of seconds. Allocate and clear the bitmap, then turn on bits for second within intervals. The probe to determine whether a given time occurs within at least one interval is O(1).

    This merges intervals, so it's not useful if you're trying to determine which intervals a given time occurs within. A tree structure is more appropriate for that.

    "It's not the Zen way, but it gets the job done." --Garrison Keillor

Re: Assemble times into ranges
by greenFox (Vicar) on May 06, 2001 at 18:01 UTC

    This may be a naive answer but I would treat it as a math problem. If the intervals are of a constant size then finding the appropriate interval is a matter of division. If the intervals are of variable length then I would fundge it. I would create another structure keyed by constant time intervals and have the value be a list of the appropriate variable length intervals. Find the position in the coarser constant length structure by math and then hand to a search routine to find the position in the variable length structure. Then I would only be searching through a small number for each time.

    I hope I explained that OK :)

    my $chainsaw = 'Perl';

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://78271]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2019-04-20 18:43 GMT
Find Nodes?
    Voting Booth?
    I am most likely to install a new module from CPAN if:

    Results (110 votes). Check out past polls.

    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!