Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Range overlap (with a lot of ranges)

by BioLion (Curate)
on May 24, 2010 at 11:22 UTC ( #841368=perlquestion: print w/replies, xml ) Need Help??
BioLion has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

This is a bioinformatics problem, but I think it is generally applicable, so I hope someone has some input:

The problem: Given a test range X, and N previously defined ranges, define which range (if any) X overlaps with.

Range overlap has been done to death already ( overlap between ranges, Stack overflow, Number::Range, etc... ) but I hoped someone would have some input on doing it on a large scale:

The Issue:

    My main problem is that I have a lot of pre-defined ranges (~10^6) and a lot of tests to carry out (>10^7), so speed is desirable...
    Because I have so many ranges to test against, storing them in memory requires keeping things Lite...

Potential Solutions:

    The ranges can all be sorted, and are fairly evenly distributed, so I could do some sort of 'guess then home in' algorithm.
    I could 'stringify' the ranges i.e.
    # range A : 1-3 # range B : 5-6 my $range_string = '-AAA-BB---';
    Then simply lookup which Range fits my test. Alternately a lookup table / DB approach could be used to the same ends.
    Alternate solutions welcomed...

So my question is whether anyone has come up against a similar problem, or tested any of these solutions? I think I will go with the 'lookup' method, but I wanted to check if anyone had any input. Obviously I would like to do things properly and benchmark proper comparisons, but i am a bit pushed for time.

Thanks in advance,

Update for those interested:

Turns out the 'giant array' idea is *very* memory hungry (please don't point out that this was obvious, we all live in hope...). So I am left with storing ordered ranges and doing a binary search (Thanks Marshall). I wanted to check on the various claims for memory usage made below, so here is what I came up with:

use warnings; use strict; use Devel::Size qw/total_size/; use Benchmark; ++$|; print ">>Strings:\n"; my $t0 = Benchmark->new; for (10e3, 10e4, 10e5,){ my %r; my($i,$l,$g) = (0) x 3; for( 1 .. $_ ) { $r{ int( $i / 10e3 ) } .= qq[ $i -@{[ $i + ( $l = 1500 + int( -500 + rand 1000 ) ) ] +} ]; $i+=$l + 25 + int( rand 25 ); } print "$_ : " . (total_size(\%r) ) . "\n"; } my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "String took:",timestr($td),"\n"; print ">>AoA:\n"; $t0 = Benchmark->new; for (10e3, 10e4, 10e5,){ my %r; my($i,$l,$g) = (0) x 3; for( 1 .. $_ ) { push @{ $r{ int( $i / 10e3 ) } }, [ $i, @{[ $i + ( $l = 1500 + int( -500 + rand 1000 ) ) ]} +]; $i+=$l + 25 + int( rand 25 ); } print "$_ : " . (total_size(\%r) ) . "\n"; } $t1 = Benchmark->new; $td = timediff($t1, $t0); print "Array took:",timestr($td),"\n"; print system('perl -v');
Outputting :
>>Strings: 10000 : 274788 100000 : 2949952 1000000 : 32010214 String took: 5 wallclock secs ( 5.55 usr + 0.06 sys = 5.61 CPU) >>AoA: 10000 : 1597360 100000 : 15972002 1000000 : 160299622 Array took:23 wallclock secs (21.97 usr + 0.30 sys = 22.27 CPU) This is perl, v5.8.8 built for MSWin32-x86-multi-thread (with 18 registered patches, see perl -V for more detail) Binary build 822 [280952] provided by ActiveState http://www.ActiveSta Built Jul 31 2007 19:34:48 Haha. I am fully aware how old this is, but you try convincing my work + that..
Clearly the string method is a lot faster to pre-compute and more memory efficient, and will hopefully make up for the taime taken to subsequently parse the string. It also provides a simple 'jumping' mechanism for the binary search. SO big thanks to BrowserUk. THe method is also nicely 'tunable' in that you can make the hash bigger (use more memory, but reduce the lookup window) or vice versa depending on your needs.

I will post more once i also test the lookup method, and also salva's suggestions. Thanks all round - I may actually get this done today.

Just a something something...

Replies are listed 'Best First'.
Re: Range overlap (with a lot of ranges)
by moritz (Cardinal) on May 24, 2010 at 11:30 UTC

    I haven't read the references you posted earlier, so please forgive me if that one was already suggested.

    You could store for each location an array of ranges that contain it:

    # range A : 1-3 # range B : 5-6 # leads to my @ranges = [ [], # 0 ['1-3'], # 1 ['1-3'], # 2 ['1-3'], # 3 [], # 4 ['5-6'], # 5 ['5-6'], # 6 ];

    Once this table is built, for every range it's very fast to test which other ranges overlap.

    To save memory, you can also use a string in each location, so that if you have ranges 1-5 and 4-6, the entry for for 5 would be 1-5,4-6.

    One million strings should fit into memory. If the array is rather sparse, you could also use a hash instead.

    Perl 6 - links to (nearly) everything that is Perl 6.

      I think this is similar to my 'stringify' idea (I intended to use substr to pull out each ID), and one I did try (using Number::Range objects, to populate the array). This proved too heavyweight, but your simple approach might be a winner. Thanks for the advice.

      Just a something something...
Re: Range overlap (with a lot of ranges)
by salva (Abbot) on May 24, 2010 at 14:26 UTC
    You can look up all the test points in parallel.

    Some code showing how to do it for a simplified version of your problem where the ranges cover all the domain (there are no gaps):

    #!/usr/bin/perl use strict; use warnings; my @ranges = 0; push @ranges, $ranges[-1] + 1 + int rand 200 for 1..10000; my @tests = map int rand $ranges[-1], 0..1000000; match (\@ranges, \@tests); sub div { my ($border, $tests) = @_; my ($lt, $ge) = ([], []); push @{$_ < $border ? $lt : $ge}, $_ for @$tests; ($lt, $ge); } sub match { my ($ranges, $tests) = @_; if (@$ranges == 1) { if (@$tests) { print "tests in range $ranges->[0]:\n", join(", ", @$tests +), "\n"; } else { print "range $ranges->[0] is empty\n"; } } else { my $pivot = int((@$ranges + 1)/ 2); my ($lt, $ge) = div($ranges->[$pivot], $tests); match([@{$ranges}[0..$pivot-1]], $lt); match([@{$ranges}[$pivot..$#$ranges]], $ge); } }
    And a version optimized to not copy data around but to use indexes inside the main arrays:
    #!/usr/bin/perl use strict; use warnings; my @ranges = 0; push @ranges, $ranges[-1] + 1 + int rand 200 for 1..1_000_000; my @tests = map int rand $ranges[-1], 0..10_000_000; match(0, scalar @ranges, 0, scalar @tests); sub div { my ($border, $start, $end) = @_; while ($end > $start) { if ($tests[$start] < $border) { $start++; } else { if (--$end > $start) { @tests[$start, $end] = @tests[$end, $start]; } } } $end; } sub match { my ($r_start, $r_end, $t_start, $t_end) = @_; if ($r_end - $r_start <= 1) { if ($t_end - $t_start > 0) { print "tests in range $ranges[$r_start]:\n", join(", ", @t +ests[$t_start..$t_end-1]), "\n"; } else { print "range $ranges[$r_start] is empty\n"; } } else { my $r_pivot = int(($r_start + $r_end)/ 2); my $t_pivot = div($ranges[$r_pivot], $t_start, $t_end); match($r_start, $r_pivot - 1, $t_start, $t_pivot - 1); match($r_pivot, $r_end, $t_pivot, $t_end); } }
    This algorithm, for N ranges and M test points where M > N has complexity O(MlogN) and does not require extra memory besides that used to store the ranges and the test points.

    On my not so new computer, with the print sentences commented out, it runs in a couple of minutes.

      I'll have to spend some time looking through this, but it looks like another very viable option. Thanks for your input!

      Just a something something...
Re: Range overlap (with a lot of ranges)
by Marshall (Abbot) on May 24, 2010 at 12:19 UTC
    This looks like a similar problem to one of my posts, Re: match with elements in array. Recently, I remember running some benchmarks with 100 million hash keys. Takes about 90 seconds on my not so fast machine to create a hash that big and about 25 seconds on a more up-to-date machine.

    If the hash table will fit into memory without paging, running 10^7 queries will be real fast once you have the table built. For the simplest approach to work, the number of unique values is what matters. Of course some encoding of the ranges is possible (problem above just had non-overlapping ranges, but hash value could be an array of range ID #'s or a single ID that represents multiple ranges.). Curious, how many unique values do you have that fit within one or more of the valid ranges?

Re: Range overlap (with a lot of ranges)
by BrowserUk (Pope) on May 24, 2010 at 12:33 UTC

    Are, as implied, your predefined ranges integer bounded and non-overlapping? What is the overall range 0 .. max?

Re: Range overlap (with a lot of ranges)
by BioLion (Curate) on May 24, 2010 at 12:53 UTC

    BrowserUk and Marshall - The ranges I am testing for overlap against are integer bounded, non-overlapping, dense (very few gaps) and span around 2.5 x 10^9 values, each range covering around 2000 elements. The ranges I am testing against this are smaller (50 - 100 elements). One simplification I thought of would be to condense this 50 span range to a single integer value, or to 2 integers (start/end). Hope this helps.

    Just a something something...

      As "both" of your ranges are contingous, I think you can get away by testing whether the start and end points are "close enough" to a start/end point of a range. I would look at the "inversion lists" that the Unicode consortium proposes as an efficient way to implement (tests against) Unicode character ranges.

      Some features that inversion lists provide, like fast negation, are not useful to you, so maybe you can get by with even less than what they provide. Maybe even just storing pairs (offset, length) for all your ranges and then doing a binary search for the range at hand already is enough. Maybe there even is a better data structure that allows you to simulatneously track start and end of your range and find the intervals they fall into, but I'm not aware of that.

        Cool, thanks - I'll follow this up and post anything I find useful...

        Just a something something...

      Building a flat list of 10e6 spans requires 1.9 GB as an AoAs; or 1.1 GB as an array of strings. So trying to build a tree structure using Perl's built-in data structures, or any tree or graph module that uses them is likely to run you out of memory unless you are using 64-bit and have gobs of ram.

      With a 2.5e9 range and 10e6 ranges, you're even more out of luck using a linear stick approach.

      A potentially viable solution would be to use a hash of (partitioned) strings and then a split and simple search. Storing the same 10e6 ranges this way only requires 55 MB

      my %r; my($i,$l,$g) = (0) x 3; for( 1 .. 10e6 ) { $r{ int( $i / 10e3 ) } .= qq[ $i -@{[ $i + ( $l = 1500 + int( -500 + rand 1000 ) ) ]} ]; $i+=$l + 25 + int( rand 25 ); };; print total_size \%r;; 55801653 print $r{ 1 };; 10972-12197 12230-13623 13659-15412 15445-17288 17319-18829 18870-2064 +5 print $r{ 100 };; 1001332-1002376 1002412-1003426 1003466-1005074 1005103-1006637 100667 +2-1007719 1007753-1009667 1009694-1011094 print $r{ 1000 };; 10000503-10002289 10002319-10003488 10003529-10004952 10004980-1000645 +9 10006492-10008122 10008165-10010163

      As you can see, partitioning the dataset this way allows you to zero in on a very small subset of ranges by a simple integer division and hash lookup. After that, you have only half a dozen or so ranges to consider which should be fast enough that you don't need to do anything sophisticated.

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Thanks for the info - As you say narrowing the field quickly is the key bottleneck, and I think you approach is a good one - I might follow this approach to speed up the binary search. If I get the time to properly compare the methods, I will definitely include this idea. Cheers!

        Just a something something...
      So, it appears to me that to represent the ranges to be tested against, you could make an array based structure (AoA) where first element is the start number of a range and 2nd element is size or end of that range.

      To query this structure, do binary search based upon start of range (wind up at same point as if you were "inserting" something into this array before the range), then look forward in the array until it is no longer possible for the sought for range to be in the current range of possible ranges (your start is > start of current 1st element in this "row". So binary search and some further number of comparisons to decide that it is no longer possible to match. This assumes that the ranges can be sorted by "start of range", which it appears that they can be.

      so you will have 1 million pointers to array (AoA), each "row" has 2 elements, start and end of that range.

        you could make an array based structure (AoA) where first element is the start number of a range and 2nd element is size or end of that range

        That's not very inefficient as it would use a lot of memory unnecessarily. It's better to just use a pair of arrays (@start, @end):

        DB<2> use Devel::Size 'total_size' DB<3> total_size([map [rand, rand], 1..1000]) DB<4> use Devel::Size 'total_size' DB<5> x total_size([map [rand, rand], 1..1000]) 0 224176 DB<6> x total_size([[map rand, 1..1000], [map rand, 1..1000]]) 0 80464

        So far a simple lookup based on a sorted array (each element storing the id of the range that covers it) hasn't thrashed my computer, but I am still testing. If this fails I will try a binary search (what I meant by 'guess and home in'... no CS background...!). Thanks again.

        Just a something something...
Re: Range overlap (with a lot of ranges)
by Marshall (Abbot) on May 25, 2010 at 02:46 UTC
    I ran your benchmark code:

    >>Strings: String took: 8 wallclock secs ( 8.06 usr + 0.03 sys = 8.09 CPU) >>AoA: Array took: 8 wallclock secs ( 8.33 usr + 0.13 sys = 8.45 CPU) This is perl, v5.10.1 built for MSWin32-x86-multi-thread (with 2 registered patches, see perl -V for more detail)
    It is not clear that your benchmarking code represents what the application would really do.
Re: Range overlap (with a lot of ranges)
by happy.barney (Pilgrim) on May 25, 2010 at 10:36 UTC
    Do you require to process tests one-by-one or all at once? For all at once, you can store ranges and tests in sorted files and than use idea stolen from merge sort - read range/test when current range/test is out of scope.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://841368]
Approved by moritz
Front-paged by planetscape
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2017-06-28 05:42 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (624 votes). Check out past polls.