Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Re: Range overlap (with a lot of ranges)

by BioLion (Curate)
on May 24, 2010 at 12:53 UTC ( #841383=note: print w/replies, xml ) Need Help??

in reply to Range overlap (with a lot of ranges)

BrowserUk and Marshall - The ranges I am testing for overlap against are integer bounded, non-overlapping, dense (very few gaps) and span around 2.5 x 10^9 values, each range covering around 2000 elements. The ranges I am testing against this are smaller (50 - 100 elements). One simplification I thought of would be to condense this 50 span range to a single integer value, or to 2 integers (start/end). Hope this helps.

Just a something something...
  • Comment on Re: Range overlap (with a lot of ranges)

Replies are listed 'Best First'.
Re^2: Range overlap (with a lot of ranges)
by Corion (Pope) on May 24, 2010 at 12:59 UTC

    As "both" of your ranges are contingous, I think you can get away by testing whether the start and end points are "close enough" to a start/end point of a range. I would look at the "inversion lists" that the Unicode consortium proposes as an efficient way to implement (tests against) Unicode character ranges.

    Some features that inversion lists provide, like fast negation, are not useful to you, so maybe you can get by with even less than what they provide. Maybe even just storing pairs (offset, length) for all your ranges and then doing a binary search for the range at hand already is enough. Maybe there even is a better data structure that allows you to simulatneously track start and end of your range and find the intervals they fall into, but I'm not aware of that.

      Cool, thanks - I'll follow this up and post anything I find useful...

      Just a something something...
Re^2: Range overlap (with a lot of ranges)
by BrowserUk (Pope) on May 24, 2010 at 15:02 UTC

    Building a flat list of 10e6 spans requires 1.9 GB as an AoAs; or 1.1 GB as an array of strings. So trying to build a tree structure using Perl's built-in data structures, or any tree or graph module that uses them is likely to run you out of memory unless you are using 64-bit and have gobs of ram.

    With a 2.5e9 range and 10e6 ranges, you're even more out of luck using a linear stick approach.

    A potentially viable solution would be to use a hash of (partitioned) strings and then a split and simple search. Storing the same 10e6 ranges this way only requires 55 MB

    my %r; my($i,$l,$g) = (0) x 3; for( 1 .. 10e6 ) { $r{ int( $i / 10e3 ) } .= qq[ $i -@{[ $i + ( $l = 1500 + int( -500 + rand 1000 ) ) ]} ]; $i+=$l + 25 + int( rand 25 ); };; print total_size \%r;; 55801653 print $r{ 1 };; 10972-12197 12230-13623 13659-15412 15445-17288 17319-18829 18870-2064 +5 print $r{ 100 };; 1001332-1002376 1002412-1003426 1003466-1005074 1005103-1006637 100667 +2-1007719 1007753-1009667 1009694-1011094 print $r{ 1000 };; 10000503-10002289 10002319-10003488 10003529-10004952 10004980-1000645 +9 10006492-10008122 10008165-10010163

    As you can see, partitioning the dataset this way allows you to zero in on a very small subset of ranges by a simple integer division and hash lookup. After that, you have only half a dozen or so ranges to consider which should be fast enough that you don't need to do anything sophisticated.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thanks for the info - As you say narrowing the field quickly is the key bottleneck, and I think you approach is a good one - I might follow this approach to speed up the binary search. If I get the time to properly compare the methods, I will definitely include this idea. Cheers!

      Just a something something...
Re^2: Range overlap (with a lot of ranges)
by Marshall (Abbot) on May 24, 2010 at 13:16 UTC
    So, it appears to me that to represent the ranges to be tested against, you could make an array based structure (AoA) where first element is the start number of a range and 2nd element is size or end of that range.

    To query this structure, do binary search based upon start of range (wind up at same point as if you were "inserting" something into this array before the range), then look forward in the array until it is no longer possible for the sought for range to be in the current range of possible ranges (your start is > start of current 1st element in this "row". So binary search and some further number of comparisons to decide that it is no longer possible to match. This assumes that the ranges can be sorted by "start of range", which it appears that they can be.

    so you will have 1 million pointers to array (AoA), each "row" has 2 elements, start and end of that range.

      you could make an array based structure (AoA) where first element is the start number of a range and 2nd element is size or end of that range

      That's not very inefficient as it would use a lot of memory unnecessarily. It's better to just use a pair of arrays (@start, @end):

      DB<2> use Devel::Size 'total_size' DB<3> total_size([map [rand, rand], 1..1000]) DB<4> use Devel::Size 'total_size' DB<5> x total_size([map [rand, rand], 1..1000]) 0 224176 DB<6> x total_size([[map rand, 1..1000], [map rand, 1..1000]]) 0 80464
        Well of course two @vars will take less memory space than a AoA because there is at least an additional array, the array of pointers to array in the AoA!

        So why does this make sense?

        One basic reason why data is collected into combined structures like an array of pointers to array, is so that it is clear what data "goes with" with other data. This is a huge thing in software engineering!

        Another reason is that when we say sort a complex data structure that invovles pointers, we just adjust the pointer values to a single array instead of doing copies of the actual data or having to modify multiple data structures. We can do manipulations on a single "thing", again a huge plus.

        I suggested an AoA, but actually the most analogous thing to a 'C' 2-D array uses even more memory, a Perl AoH, not a Perl AoA!

        I suggested the AoA instead of AoH because the Op is "trying to get the job done" and there isn't going to be a huge amount of code. I think this could be just one page of Perl code. I don't see a need to "super optimize" this.

        Normally the rough priority of coding, in my mind is: 1) clarity, 2) clarity, 3) speed, 4) memory usage. There of course can be reasons and very good reasons to violate this priority scheme. The speed and memory usage actually get better (become less important) according to Moore's Law. The clarity of the code is fixed when you write it. It won't get any more clear even if CPU MIP's and memory prices drop dramatically. If the code will run easily within memory budget (in this case yes) and the main objective is clear code that works (in this case yes), then optimizing the memory usage is not the right thing to do.

      So far a simple lookup based on a sorted array (each element storing the id of the range that covers it) hasn't thrashed my computer, but I am still testing. If this fails I will try a binary search (what I meant by 'guess and home in'... no CS background...!). Thanks again.

      Just a something something...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://841383]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2019-11-17 17:57 GMT
Find Nodes?
    Voting Booth?
    Strict and warnings: which comes first?

    Results (86 votes). Check out past polls.