baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

Well i have a question for you guys. so what i need is, i need to do some interval searching. Example: I have a set of random (randomly distributed) intervals (all related in the following way a<=b, where a and b are start and stop positions of intervals)
now what i need to do is to compute the coverage of a certain position in my interval. so if i have :

1..10 - interval and 2..5 - subinterval 4..7 - subinterval
my coverage of the 1-10 would be :
1->0 2->1 3->1 4->2 5->2 6->1 7->1 8->0 9->0 10->0
the naive solution to do this would be to create an array and then for every interval position ++ the value indexed by the position. BUT this is not an option since i don't have enough memory on my machine (the intervals are as long ad 10^9 positions). Now what i could do is something like this:
#!/usr/bin/perl use Bit::Vector; use Data::Dumper; use strict; my @vecArray; my $x=0; while($x<4){ $vecArray[$x] = Bit::Vector->new(21); my $int = int(rand(21)); my $min = $int; my $max = $int<=20-5 ? $int+5 : $int; $vecArray[$x]->Interval_Fill($min,$max); $x++; } my $set1= Bit::Vector->new(21); $set1->Intersection($vecArray[0],$vecArray[2]); my $start = 0; while ($start < $set1->Size() and my ($min, $max) = $set1->Interval_Sc +an_inc($start)){ print $min . "-" . $max . "\n"; $start = $max+1 ; }
so, create a vector (or should i say several of them) and fill it with subinterval positions... Oh by the way every set of subintervals contains only non-overlapping intervals, meaning there cannot be a case as stated above , but the subintervals from different sets CAN overlap. so in the above case one subinterval would be from one set and the other from a different set(The number of intervals in both subinterval sets can be as high as interval/avg(length of subinterval)).
So this approach is very cheap when it comes to memory and computation and is very practical since i can pinpoint the location of two intersecting sets,but how to do a multi intersection and to do it efficiently, meaning no pairwise comparisons(it would PROBABLY be unacceptable since the number of subsets can rise to the size of the underlying interval), is is a completely different thing. So this is the question i cannot answer or am i just too stupid and tired to see the solution (I had a feeling that, that last beer was a little too much, dammit :)). Summary:
|interval| = X |subinterval| = X/n (n- number of intervals) Y=X (Y - nubmber of subinterval sets) O(X^2) - not good if implemented naively
Does someone has a suggestion , tip, alternative approach ??

Cheers

UPDATE: Thanks ppl, these three replays gave me quite a lot to think about. so, to give a quick global thank you and a few answers to monks that have replied so far.
I know you guys are not mind readers but you are really good at it :). so:

zek152 : This is a nice idea and as soon as you posted it i started to work on it, actually i knew i read about it somewhere before (never had a need to actually do any implementation though) so when i saw your post i started working on it straight away. but my original idea was the same as the one the BrowserUk posted, though i didn't realised that it can be achieved it the way he did it. so the answer to BrowserUk's question is : yes. the thing i had in mind when i was writing this post was that i have to do a pair-wise comparison for every two possible pairs of vectors then we are talking about "for within a for loop" and that is O(x^2) but as you pointed out looping the intersect through vectors does the same thing essentially (meaning useful to me) but in the O(x) time. so once more a fresh pair of eyes see better then sleepy old ones.

Of course searching through start and stop indexes is a good solution but unfortunately my intervals are rather short (sorry for not stating that) which means too large index arrays... But thanx.

So thank you guys once more and this is definitely going to be a good reference post for me , of course if anyone else has an alternative strategy he/she is more then welcome to post it :)

Cheers

Replies are listed 'Best First'.
Re: searching a vector array
by Grimy (Pilgrim) on Jun 09, 2011 at 17:17 UTC
    If there's many more numbers than intervals, iterating over the number is indeed a bad idea. But if I understand your demand correctly, the desired can be achieved by iterating over intervals (or rather, over their min and max values: 2..5 and 4..7 is the same as 2..7 and 4..5 regarding coverage).

    So you would have to create two arrays (e.g. @min and @max) holding respectively the lower and higher bounds of all intervals. Then:

    my $count = 0; foreach $min (@min) { $count++ if ($_[0] >= $min) } foreach $max (@max) { $count-- if ($_[0] > $max) } return $count;
    Would give the coverage of $_[0]. Of course, it can be optimized further by sorting @min and @max beforehand and using dichotomy, which would get you an O(ln(X)).
Re: searching a vector array
by BrowserUk (Pope) on Jun 09, 2011 at 19:12 UTC
    ,but how to do a multi intersection and to do it efficiently,

    Finding the intersection of multiple bit vectors is just a case of bitwise-anding them together:

    #! perl -slw use strict; our $V //= 20; our $L //= 72; my @vecs = map { pack 'C*', 0xf0, map rand( 256 ), 1 .. (($L-8) / 8)+1; } 1 .. $V; printf "%2d : %s\n", $_, unpack 'b*', $vecs[ $_ ] for 0 .. $#vecs; my $intersect = $vecs[ 0 ]; $intersect &= $vecs[ $_ ] for 1 .. $#vecs; printf "** : %s\n", unpack 'b*', $intersect; __END__ C:\test>908952 -L=64 -V=40 0 : 00001111101000110111011000000010111100010110001010101011010111010 +0011011 1 : 00001111110100001011011011110111100101000011000101101101000101000 +1110011 2 : 00001111001111110100011011101111000101010000111001011110011010110 +0001111 3 : 00001111001000110011101001010110001010000100010010011101110100011 +1101100 4 : 00001111101010011111111001111101000101100110100111101011100100101 +0010000 5 : 00001111111101111101100001000001101011100110011110001111100010010 +0001110 6 : 00001111110010001001011000101100100001101101001110111010100100110 +1100111 7 : 00001111100100010011101101001100011000111101110001111011010000110 +0010011 8 : 00001111011010110100010000011011000111100001010001000110010110100 +1100110 9 : 00001111000000101001110101010101000001110011101001101110001000101 +0101101 10 : 00001111010000111101010000001101110000001100111000100010110101110 +0000101 11 : 00001111110010001000011100110111011110000011111101100010100111100 +0011111 12 : 00001111100110111100010100010111110101110110110111100001100111111 +1100000 13 : 00001111100011111011010101001110111111101001011110001001000001010 +1111101 14 : 00001111111110111001011010110110101100111110011100101111101101000 +1010001 15 : 00001111100110101110001110010011101000110110010100001110110011101 +1111101 16 : 00001111111011110001110011001001011010000100000110111101001111111 +1110101 17 : 00001111001111111010001010001100010010010100100000101111001100111 +0011010 18 : 00001111011000100101001101001101100101001011010001011001000100100 +1011101 19 : 00001111110110100111001101000010101100011000000000100010001111001 +1011101 20 : 00001111010010011101111110100011111011100110101011100111111111111 +1100011 21 : 00001111011001101110110110100001111111101100011100111101011011011 +0011111 22 : 00001111111110110000011110110000011001011100101101010011001111010 +1110111 23 : 00001111111110001000010001011000010011100111101010001000011001001 +0101100 24 : 00001111010010000000111110001101000011010010000101110010111001101 +0000110 25 : 00001111100111000000010000101010101010111000111111010110011010001 +0011101 26 : 00001111001010011000010010011010111111111000111111000001110010110 +0110001 27 : 00001111011011011100101111011010011010010010100110110001110100100 +0111000 28 : 00001111010101110010000011001011111100010001011101000101000001100 +1101100 29 : 00001111100100010001000111010011010100101110101000110111010101101 +0111010 30 : 00001111101110101011011111001101010010001100010010010100001110011 +0000101 31 : 00001111011011000011101101101010001000100010000011101011111100010 +0010000 32 : 00001111110010010011000011110111100100001100011110111111000001110 +0010101 33 : 00001111100011101101101110000110001111011101110101001001100111111 +0001010 34 : 00001111011011000111000100100001100101111000100110010001010011110 +1000111 35 : 00001111111111010000010010100101101101110110100111101110101110010 +0011100 36 : 00001111101100010100001101001000010111000000011011111011001001010 +0000011 37 : 00001111001010001000111101001011110011000001001000011000100010111 +1100101 38 : 00001111001001000100110000111110001011011001011100011011101011101 +0110101 39 : 00001111000111101110011001010100101101001101110110111010111001110 +0101001 ** : 00001111000000000000000000000000000000000000000000000000000000000 +0000000

    And that's an O(N) (and very efficient O(N)) process. Or have I misread you?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: searching a vector array
by zek152 (Pilgrim) on Jun 09, 2011 at 16:30 UTC

    I am not quite clear if this fulfills the requirements but if all you need to do is to see if a single value is covered you could implement a binary search tree.

    The binary search tree could be arranged based on the starting location. Then if the point you are considering is less than or equal to the starting point of a subinterval you go down the left branch. If it is greater than the starting value you check to see if it is less (or equalto) than the ending value. If so then the point is covered. If not then you continue traversing the tree in a recursive manner. if you make it to a null link then the point is not covered.

    Hopefully this made sense. The search complexity would be O(log(#subintervals))

    interval (1,20] subintervals (1,3],(5,10],(11,12],(15,17] A (sort of) balanced tree would look like: (5,10] (1,3] (15,17] (11,12]

    Update: I didn't have time to finish the example. I also made a correction to the traversal algorithm (in italics)

    If you consider the number 4 with the above tree 4<=5 so you traverse down the left side. 4>1 && 4>3 so you take the right child the right child is null so 4 is not covered. Considering 12. 12>10 so go down right branch 12<=15 so go down left branch 12>11 and 12<=12 so 12 is covered.<

    I plan on working out an efficient algorithm for finding and removing subsets of subsets (removing (1,3] when (1,5] is a subset). I will post when complete.

      After doing some research I have found the notion of an Interval_tree. This seems like a great data structure for your problem.

Re: searching a vector array
by johngg (Canon) on Jun 09, 2011 at 23:42 UTC

    If you have your sub-intervals in an array of hashes (minimum and maximum values) sorted in ascending minimum value, you can just work your way through the interval from beginning to end. You can shift off sub-intervals once you have passed their maximum value so save wasted tests. This method will not take up much memory as you are not having to build large vectors, only store minimum and maximum values.

    #!/usr/bin/perl # use strict; use warnings; my $intervalStr = q{1..25}; my @subIntervalStrs = qw{ 3..12 7..14 11..19 12..17 }; my ( $begin, $end ) = split m{\.\.}, $intervalStr; my @subIntervals = sort { $a->{ min } <=> $b->{ min } } map { my ( $min, $max ) = split m{\.\.}, $_; { min => $min, max => $max }; } @subIntervalStrs; for my $value ( $begin .. $end ) { my $count = 0; shift @subIntervals while @subIntervals and $value > $subIntervals[ 0 ]->{ max }; for my $subInterval ( @subIntervals ) { if ( $value < $subInterval->{ min } ) { last; } elsif ( $value > $subInterval->{ max } ) { next; } else { $count ++; } } printf qq{%4d => %d\n}, $value, $count; }

    The output.

    1 => 0 2 => 0 3 => 1 4 => 1 5 => 1 6 => 1 7 => 2 8 => 2 9 => 2 10 => 2 11 => 3 12 => 4 13 => 3 14 => 3 15 => 2 16 => 2 17 => 2 18 => 1 19 => 1 20 => 0 21 => 0 22 => 0 23 => 0 24 => 0 25 => 0

    I hope I have understood what you are trying to achieve and that this is helpful.

    Update: Corrected a logic error that where the upper end of a sub-interval was not correctly detected.

    Cheers,

    JohnGG