baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

Well i have a question for you guys. so what i need is, i need to do some interval searching. Example: I have a set of random (randomly distributed) intervals (all related in the following way a<=b, where a and b are start and stop positions of intervals)
now what i need to do is to compute the coverage of a certain position in my interval. so if i have :

```1..10 - interval
and

2..5 - subinterval
4..7 - subinterval

my coverage of the 1-10 would be :
```1->0
2->1
3->1
4->2
5->2
6->1
7->1
8->0
9->0
10->0
the naive solution to do this would be to create an array and then for every interval position ++ the value indexed by the position. BUT this is not an option since i don't have enough memory on my machine (the intervals are as long ad 10^9 positions). Now what i could do is something like this:
```
#!/usr/bin/perl

use Bit::Vector;
use Data::Dumper;
use strict;

my @vecArray;
my \$x=0;
while(\$x<4){
\$vecArray[\$x] =  Bit::Vector->new(21);
my \$int = int(rand(21));
my \$min = \$int;
my \$max = \$int<=20-5 ? \$int+5 : \$int;
\$vecArray[\$x]->Interval_Fill(\$min,\$max);
\$x++;
}
my \$set1=  Bit::Vector->new(21);
\$set1->Intersection(\$vecArray,\$vecArray);
my \$start = 0;

while (\$start < \$set1->Size() and my (\$min, \$max) = \$set1->Interval_Sc
+an_inc(\$start)){
print \$min . "-" . \$max . "\n";
\$start = \$max+1 ;
}

so, create a vector (or should i say several of them) and fill it with subinterval positions... Oh by the way every set of subintervals contains only non-overlapping intervals, meaning there cannot be a case as stated above , but the subintervals from different sets CAN overlap. so in the above case one subinterval would be from one set and the other from a different set(The number of intervals in both subinterval sets can be as high as interval/avg(length of subinterval)).
So this approach is very cheap when it comes to memory and computation and is very practical since i can pinpoint the location of two intersecting sets,but how to do a multi intersection and to do it efficiently, meaning no pairwise comparisons(it would PROBABLY be unacceptable since the number of subsets can rise to the size of the underlying interval), is is a completely different thing. So this is the question i cannot answer or am i just too stupid and tired to see the solution (I had a feeling that, that last beer was a little too much, dammit :)). Summary:
```
|interval| = X
|subinterval| = X/n  (n- number of intervals)
Y=X  (Y - nubmber of subinterval sets)

O(X^2)   - not good if implemented naively

Does someone has a suggestion , tip, alternative approach ??

Cheers

UPDATE: Thanks ppl, these three replays gave me quite a lot to think about. so, to give a quick global thank you and a few answers to monks that have replied so far.
I know you guys are not mind readers but you are really good at it :). so:

zek152 : This is a nice idea and as soon as you posted it i started to work on it, actually i knew i read about it somewhere before (never had a need to actually do any implementation though) so when i saw your post i started working on it straight away. but my original idea was the same as the one the BrowserUk posted, though i didn't realised that it can be achieved it the way he did it. so the answer to BrowserUk's question is : yes. the thing i had in mind when i was writing this post was that i have to do a pair-wise comparison for every two possible pairs of vectors then we are talking about "for within a for loop" and that is O(x^2) but as you pointed out looping the intersect through vectors does the same thing essentially (meaning useful to me) but in the O(x) time. so once more a fresh pair of eyes see better then sleepy old ones.

Of course searching through start and stop indexes is a good solution but unfortunately my intervals are rather short (sorry for not stating that) which means too large index arrays... But thanx.

So thank you guys once more and this is definitely going to be a good reference post for me , of course if anyone else has an alternative strategy he/she is more then welcome to post it :)

Cheers

Replies are listed 'Best First'.
Re: searching a vector array
by Grimy (Pilgrim) on Jun 09, 2011 at 17:17 UTC
If there's many more numbers than intervals, iterating over the number is indeed a bad idea. But if I understand your demand correctly, the desired can be achieved by iterating over intervals (or rather, over their min and max values: 2..5 and 4..7 is the same as 2..7 and 4..5 regarding coverage).

So you would have to create two arrays (e.g. @min and @max) holding respectively the lower and higher bounds of all intervals. Then:

```my \$count = 0;
foreach \$min (@min) {
\$count++ if (\$_ >= \$min)
}
foreach \$max (@max) {
\$count-- if (\$_ > \$max)
}
return \$count;
Would give the coverage of \$_. Of course, it can be optimized further by sorting @min and @max beforehand and using dichotomy, which would get you an O(ln(X)).
Re: searching a vector array
by BrowserUk (Pope) on Jun 09, 2011 at 19:12 UTC
,but how to do a multi intersection and to do it efficiently,

Finding the intersection of multiple bit vectors is just a case of bitwise-anding them together:

```#! perl -slw
use strict;

our \$V //= 20;
our \$L //= 72;

my @vecs = map {
pack 'C*', 0xf0, map rand( 256 ), 1 .. ((\$L-8) / 8)+1;
} 1 .. \$V;

printf "%2d : %s\n", \$_, unpack 'b*', \$vecs[ \$_ ]
for 0 .. \$#vecs;

my \$intersect = \$vecs[ 0 ];
\$intersect &= \$vecs[ \$_ ] for 1 .. \$#vecs;

printf "** : %s\n", unpack 'b*', \$intersect;

__END__
C:\test>908952 -L=64 -V=40
0 : 00001111101000110111011000000010111100010110001010101011010111010
+0011011
1 : 00001111110100001011011011110111100101000011000101101101000101000
+1110011
2 : 00001111001111110100011011101111000101010000111001011110011010110
+0001111
3 : 00001111001000110011101001010110001010000100010010011101110100011
+1101100
4 : 00001111101010011111111001111101000101100110100111101011100100101
+0010000
5 : 00001111111101111101100001000001101011100110011110001111100010010
+0001110
6 : 00001111110010001001011000101100100001101101001110111010100100110
+1100111
7 : 00001111100100010011101101001100011000111101110001111011010000110
+0010011
8 : 00001111011010110100010000011011000111100001010001000110010110100
+1100110
9 : 00001111000000101001110101010101000001110011101001101110001000101
+0101101
10 : 00001111010000111101010000001101110000001100111000100010110101110
+0000101
11 : 00001111110010001000011100110111011110000011111101100010100111100
+0011111
12 : 00001111100110111100010100010111110101110110110111100001100111111
+1100000
13 : 00001111100011111011010101001110111111101001011110001001000001010
+1111101
14 : 00001111111110111001011010110110101100111110011100101111101101000
+1010001
15 : 00001111100110101110001110010011101000110110010100001110110011101
+1111101
16 : 00001111111011110001110011001001011010000100000110111101001111111
+1110101
17 : 00001111001111111010001010001100010010010100100000101111001100111
+0011010
18 : 00001111011000100101001101001101100101001011010001011001000100100
+1011101
19 : 00001111110110100111001101000010101100011000000000100010001111001
+1011101
20 : 00001111010010011101111110100011111011100110101011100111111111111
+1100011
21 : 00001111011001101110110110100001111111101100011100111101011011011
+0011111
22 : 00001111111110110000011110110000011001011100101101010011001111010
+1110111
23 : 00001111111110001000010001011000010011100111101010001000011001001
+0101100
24 : 00001111010010000000111110001101000011010010000101110010111001101
+0000110
25 : 00001111100111000000010000101010101010111000111111010110011010001
+0011101
26 : 00001111001010011000010010011010111111111000111111000001110010110
+0110001
27 : 00001111011011011100101111011010011010010010100110110001110100100
+0111000
28 : 00001111010101110010000011001011111100010001011101000101000001100
+1101100
29 : 00001111100100010001000111010011010100101110101000110111010101101
+0111010
30 : 00001111101110101011011111001101010010001100010010010100001110011
+0000101
31 : 00001111011011000011101101101010001000100010000011101011111100010
+0010000
32 : 00001111110010010011000011110111100100001100011110111111000001110
+0010101
33 : 00001111100011101101101110000110001111011101110101001001100111111
+0001010
34 : 00001111011011000111000100100001100101111000100110010001010011110
+1000111
35 : 00001111111111010000010010100101101101110110100111101110101110010
+0011100
36 : 00001111101100010100001101001000010111000000011011111011001001010
+0000011
37 : 00001111001010001000111101001011110011000001001000011000100010111
+1100101
38 : 00001111001001000100110000111110001011011001011100011011101011101
+0110101
39 : 00001111000111101110011001010100101101001101110110111010111001110
+0101001
** : 00001111000000000000000000000000000000000000000000000000000000000
+0000000

And that's an O(N) (and very efficient O(N)) process. Or have I misread you?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: searching a vector array
by zek152 (Pilgrim) on Jun 09, 2011 at 16:30 UTC

I am not quite clear if this fulfills the requirements but if all you need to do is to see if a single value is covered you could implement a binary search tree.

The binary search tree could be arranged based on the starting location. Then if the point you are considering is less than or equal to the starting point of a subinterval you go down the left branch. If it is greater than the starting value you check to see if it is less (or equalto) than the ending value. If so then the point is covered. If not then you continue traversing the tree in a recursive manner. if you make it to a null link then the point is not covered.

Hopefully this made sense. The search complexity would be O(log(#subintervals))

```interval (1,20]
subintervals (1,3],(5,10],(11,12],(15,17]
A (sort of) balanced tree would look like:
(5,10]
(1,3]  (15,17]
(11,12]

Update: I didn't have time to finish the example. I also made a correction to the traversal algorithm (in italics)

```If you consider the number 4 with the above tree
4<=5 so you traverse down the left side.
4>1 && 4>3 so you take the right child
the right child is null so 4 is not covered.
Considering 12.
12>10 so go down right branch
12<=15 so go down left branch
12>11 and 12<=12 so 12 is covered.<

I plan on working out an efficient algorithm for finding and removing subsets of subsets (removing (1,3] when (1,5] is a subset). I will post when complete.

After doing some research I have found the notion of an Interval_tree. This seems like a great data structure for your problem.

Re: searching a vector array
by johngg (Canon) on Jun 09, 2011 at 23:42 UTC

If you have your sub-intervals in an array of hashes (minimum and maximum values) sorted in ascending minimum value, you can just work your way through the interval from beginning to end. You can shift off sub-intervals once you have passed their maximum value so save wasted tests. This method will not take up much memory as you are not having to build large vectors, only store minimum and maximum values.

```#!/usr/bin/perl
#
use strict;
use warnings;

my \$intervalStr = q{1..25};

my @subIntervalStrs = qw{
3..12
7..14
11..19
12..17
};

my ( \$begin, \$end ) = split m{\.\.}, \$intervalStr;
my @subIntervals =
sort { \$a->{ min } <=> \$b->{ min } }
map
{
my ( \$min, \$max ) = split m{\.\.}, \$_;
{ min => \$min, max => \$max };
}
@subIntervalStrs;

for my \$value ( \$begin .. \$end )
{
my \$count = 0;

shift @subIntervals while
@subIntervals
and
\$value > \$subIntervals[ 0 ]->{ max };

for my \$subInterval ( @subIntervals )
{
if ( \$value < \$subInterval->{ min } )
{
last;
}
elsif ( \$value > \$subInterval->{ max } )
{
next;
}
else
{
\$count ++;
}
}

printf qq{%4d => %d\n}, \$value, \$count;
}

The output.

```   1 => 0
2 => 0
3 => 1
4 => 1
5 => 1
6 => 1
7 => 2
8 => 2
9 => 2
10 => 2
11 => 3
12 => 4
13 => 3
14 => 3
15 => 2
16 => 2
17 => 2
18 => 1
19 => 1
20 => 0
21 => 0
22 => 0
23 => 0
24 => 0
25 => 0