A guy posted a question to p5p today.
It was in the wrong place (p5p is not for questions about
programming in Perl) so I sent him away, but it was still
an interesting question.
He has a hash %interval where the keys are
start times and the values are corresponding end times.
He also has an array @time of interesting times.
He want to know which elements of @time
occur inside one of the intervals.
Here's something like his original code:
while (($start, $end) = each %interval) {
for $time (@time) {
if ($time >= $start && $time < $end) {
+
push @match, $time;
}
}
}
He has a problem: This is too slow. %interval contains
10,000 intervals, and @time contains 5,000 times.
He needs it to be faster.
I can see that improvements might be possible on two levels.
One could improve the code without changing the data structures.
For example, you could sort keys %interval and
@time and then go through the @time items
in sorted order, ignoring the starttimes that were too late.
You could replace the inner loop with grep.
Probably a better approach would be to improve the data
structures. I can't think offhand of a good way to store
the information, however. (One thing that does come to mind
is that he should preprocess the intervals, and
merge any intervals that overlap into one large interval.
Then he could make a single pass over both lists.)

Mark Dominus
Perl Paraphernalia
Re (tilly) 1: Assemble times into ranges
by tilly (Archbishop) on May 06, 2001 at 01:15 UTC

I think a better data structure is the answer.
He should use DB_File with a BTREE and specify a custom
sort order if it makes sense. In there he should put the
times that he changes from inside to outside of any
interval, and what he did. He can then use the OO
interface, and use the seq method (with the R_CURSOR flag)
to find the next action efficiently, and from that
he can figure out his current state.
Here is example code demonstrating this that goes one
step further. Note that the API I am coding to has some
things I don't like (sorry, I don't like destructive
methods) but I am more than willing to put up with it for
the functionality provided:
#!/usr/bin/perl w
use strict;
use DB_File;
# The original data structure
my %interval = qw(
5 50
20 30
70 90
120 130
);
# The times
my @times = qw(4 9 30 98 125 500);
# Let us rearrange the interval information
# into events consisting of a time/action.
my @events = map {[$_, 1]} keys %interval;
push @events, map {[$_, 1]} values %interval;
# Let us turn this array into something we can
# do a fast lookup on...
# First we need a useful comparison order:
$DB_BTREE>{'compare'} = sub {$_[0] <=> $_[1]};
# Next we need a hash to store information in
my %interval_end;
tie(
%interval_end, 'DB_File', undef,
O_RDWRO_CREAT, 0640, $DB_BTREE
) or die "Cannot tie hash: $!";
# Now store the information.
my $nesting = 0;
foreach my $event (sort {$a>[0] <=> $b>[0]} @events) {
my ($time, $action) = @$event;
# Careful, multiple intervals may share a boundary...
unless (exists $interval_end{$time}) {
$interval_end{$time} = $nesting;
}
$nesting += $action;
}
# Now we get the tied object so we can do our fast lookup.
my $interval_depth = tied %interval_end;
# And now let us get our answers:
foreach my $time (@times) {
my $ended = $time;
my $depth = 0;
$interval_depth>seq($ended, $depth, R_CURSOR);
print "$time is in an interval ending at $ended of depth $depth\n";
}
__END__
Output:
4 is in an interval ending at 5 of depth 0
9 is in an interval ending at 20 of depth 1
30 is in an interval ending at 30 of depth 2
98 is in an interval ending at 120 of depth 0
125 is in an interval ending at 130 of depth 1
500 is in an interval ending at 500 of depth 0
Note that the final entry gave a somewhat nonintuitive
answer. And there could be more error checking. Plus all
intervals have been made open on the left and closed on the
right. But oh well.
(This should be efficient on his original data set.)
UPDATE
There is more than a little irony in my pointing out the
capabilities of BTrees to Dominus. The first place
that I heard of them was
this article. (Which I
enjoyed quite a bit.)  [reply] [d/l] 
Re (tilly) 1 (duh): Assemble times into ranges
by tilly (Archbishop) on May 06, 2001 at 04:11 UTC

The BTree answer was horrible overkill.
It is a great idea if you needed to process the timepoints
as they come in. But a naive sorting solution is much
simpler and gets better control. After all we have to sort
the beginning and end of the intervals Just toss in the
timepoints as well. For instance if you want
to get closed intervals, you can do it like this:
#!/usr/bin/perl w
use strict;
# The original data structure
my %interval = qw(
5 50
20 30
70 90
120 130
);
# The times
my @times = qw(4 9 30 98 125 500);
# Current nesting depth
my $depth = 0;
# Let us rearrange the interval information
# into events consisting of a time/action
my @events = map {[$_, 0, sub {++$depth}]} keys %interval;
push @events, map {[$_, 2, sub {$depth}]} values %interval;
push @events, map {
my $time = $_;
[$_, 1, sub {print "Time $time is at depth $depth\n"}]
} @times;
# Sort it and read out the answer:
foreach my $event (
sort {
$a>[0] <=> $b>[0] or $a>[1] <=> $b>[1]
} @events
) {
$event>[2]>();
}
(And, of course, you can make the whole thing much cleaner,
this is just proof of concept code.)  [reply] [d/l] 

 [reply] 

The depth variable tells you how many events fall in it.
If you want to instead know the intervals, you could use a
hash there instead. Click for code if you want an
example...
 [reply] [d/l] 
Re: Assemble times into ranges
by Masem (Monsignor) on May 05, 2001 at 22:58 UTC

This idea requires a bit of set up, but it only goes through the lists twice (eg O(n+m), instead of O(n*m)) with one sort (O(nlogn)).
For each iterval, add 2 items to a hash, one with key:value of the start time and a unique interger (eg $i++); the second with the end time and the negative of that integer.
For each even, add one item to the same hash, key the event time, and value 0.
Grab the keys of the hash, and sort on time, with secondary sort on the ascending order of the absolute value of the hash value Eg, for three values with teh same key, the order would be something like 0, 3, 5.
Now, just iterate along the keys. If the value of a key is a positive integer, then in another hash (interval_hash), set that key to some value (eg, if the integer is 3, the interval_hash would be key 3, value 1). If a negative interger, undef that value from the interval_hash. If zero, then in a third hash (occured_hash), for each key currently in interval_hash, push the time onto an array for that interval. So, if time is currently 10, and interval_hash contains keys 3, 4, and 7, then for occured_hash keys 3,4, and 7, you'd push 10 onto the arrays that are referenced by this.
When you are all done, then the values for a given interval key in occured_hash will be an array with all the times that fall into those events.
Again, this should be faster based on the order calculations, but you'd have to try it out and make sure that the little things like hash manipulations don't slow it down too much (it shouldn't).
UPDATE: ok, the trick on the secondary sorting (dealing with intervals and events that occur at the same time), partially has to do if you want inclusive or exclusive intervals. If you want inclusive, then the secondary sort should be in decending, nonabsolute order (eg 5, 0, 3 ), such that you 'initiate' new intervals before you process the events, and then 'deactivate' any intervals after processing. If you want 'exclusive', then sort in increasing order (3, 0, 5) to delete intervals before dealing with events, then creating new ones.
Dr. Michael K. Neylon  mneylonpm@masemware.com

"You've left the lens cap of your mind on again, Pinky"  The Brain
 [reply] 
Re: Assemble times into ranges
by dws (Chancellor) on May 05, 2001 at 23:26 UTC

Probably a better approach would be to improve the data structures.
How about a brute force spacefortime tradeoff?
A 3.8Mb bitmap can represent an entire year's worth of seconds. Allocate and clear the bitmap, then turn on bits for second within intervals. The probe to determine whether a given time occurs within at least one interval is O(1).
This merges intervals, so it's not useful if you're trying to determine which intervals a given time occurs within. A tree structure is more appropriate for that.
"It's not the Zen way, but it gets the job done." Garrison Keillor
 [reply] 
Re: Assemble times into ranges
by greenFox (Vicar) on May 06, 2001 at 18:01 UTC

This may be a naive answer but I would treat it as a math problem. If the intervals are of a constant size then finding the appropriate interval is a matter of division. If the intervals are of variable length then I would fundge it. I would create another structure keyed by constant time intervals and have the value be a list of the appropriate variable length intervals. Find the position in the coarser constant length structure by math and then hand to a search routine to find the position in the variable length structure. Then I would only be searching through a small number for each time. I hope I explained that OK :)
 my $chainsaw = 'Perl';
 [reply] 

