http://www.perlmonks.org?node_id=841368

BioLion has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

This is a bioinformatics problem, but I think it is generally applicable, so I hope someone has some input:

The problem: Given a test range X, and N previously defined ranges, define which range (if any) X overlaps with.

Range overlap has been done to death already ( overlap between ranges, Stack overflow, Number::Range, etc... ) but I hoped someone would have some input on doing it on a large scale:

The Issue:

    My main problem is that I have a lot of pre-defined ranges (~10^6) and a lot of tests to carry out (>10^7), so speed is desirable...
    Because I have so many ranges to test against, storing them in memory requires keeping things Lite...

Potential Solutions:

    The ranges can all be sorted, and are fairly evenly distributed, so I could do some sort of 'guess then home in' algorithm.
    I could 'stringify' the ranges i.e.
    # range A : 1-3 # range B : 5-6 my $range_string = '-AAA-BB---';
    Then simply lookup which Range fits my test. Alternately a lookup table / DB approach could be used to the same ends.
    Alternate solutions welcomed...

So my question is whether anyone has come up against a similar problem, or tested any of these solutions? I think I will go with the 'lookup' method, but I wanted to check if anyone had any input. Obviously I would like to do things properly and benchmark proper comparisons, but i am a bit pushed for time.

Thanks in advance,

Update for those interested:

Turns out the 'giant array' idea is *very* memory hungry (please don't point out that this was obvious, we all live in hope...). So I am left with storing ordered ranges and doing a binary search (Thanks Marshall). I wanted to check on the various claims for memory usage made below, so here is what I came up with:

use warnings; use strict; use Devel::Size qw/total_size/; use Benchmark; ++$|; print ">>Strings:\n"; my $t0 = Benchmark->new; for (10e3, 10e4, 10e5,){ my %r; my($i,$l,$g) = (0) x 3; for( 1 .. $_ ) { $r{ int( $i / 10e3 ) } .= qq[ $i -@{[ $i + ( $l = 1500 + int( -500 + rand 1000 ) ) ] +} ]; $i+=$l + 25 + int( rand 25 ); } print "$_ : " . (total_size(\%r) ) . "\n"; } my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "String took:",timestr($td),"\n"; print ">>AoA:\n"; $t0 = Benchmark->new; for (10e3, 10e4, 10e5,){ my %r; my($i,$l,$g) = (0) x 3; for( 1 .. $_ ) { push @{ $r{ int( $i / 10e3 ) } }, [ $i, @{[ $i + ( $l = 1500 + int( -500 + rand 1000 ) ) ]} +]; $i+=$l + 25 + int( rand 25 ); } print "$_ : " . (total_size(\%r) ) . "\n"; } $t1 = Benchmark->new; $td = timediff($t1, $t0); print "Array took:",timestr($td),"\n"; print system('perl -v');
Outputting :
>>Strings: 10000 : 274788 100000 : 2949952 1000000 : 32010214 String took: 5 wallclock secs ( 5.55 usr + 0.06 sys = 5.61 CPU) >>AoA: 10000 : 1597360 100000 : 15972002 1000000 : 160299622 Array took:23 wallclock secs (21.97 usr + 0.30 sys = 22.27 CPU) This is perl, v5.8.8 built for MSWin32-x86-multi-thread (with 18 registered patches, see perl -V for more detail) Binary build 822 [280952] provided by ActiveState http://www.ActiveSta +te.com Built Jul 31 2007 19:34:48 Haha. I am fully aware how old this is, but you try convincing my work + that..
Clearly the string method is a lot faster to pre-compute and more memory efficient, and will hopefully make up for the taime taken to subsequently parse the string. It also provides a simple 'jumping' mechanism for the binary search. SO big thanks to BrowserUk. THe method is also nicely 'tunable' in that you can make the hash bigger (use more memory, but reduce the lookup window) or vice versa depending on your needs.

I will post more once i also test the lookup method, and also salva's suggestions. Thanks all round - I may actually get this done today.

Just a something something...