Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re^2: Range overlap (with a lot of ranges)

by BrowserUk (Pope)
on May 24, 2010 at 15:02 UTC ( #841401=note: print w/replies, xml ) Need Help??

in reply to Re: Range overlap (with a lot of ranges)
in thread Range overlap (with a lot of ranges)

Building a flat list of 10e6 spans requires 1.9 GB as an AoAs; or 1.1 GB as an array of strings. So trying to build a tree structure using Perl's built-in data structures, or any tree or graph module that uses them is likely to run you out of memory unless you are using 64-bit and have gobs of ram.

With a 2.5e9 range and 10e6 ranges, you're even more out of luck using a linear stick approach.

A potentially viable solution would be to use a hash of (partitioned) strings and then a split and simple search. Storing the same 10e6 ranges this way only requires 55 MB

my %r; my($i,$l,$g) = (0) x 3; for( 1 .. 10e6 ) { $r{ int( $i / 10e3 ) } .= qq[ $i -@{[ $i + ( $l = 1500 + int( -500 + rand 1000 ) ) ]} ]; $i+=$l + 25 + int( rand 25 ); };; print total_size \%r;; 55801653 print $r{ 1 };; 10972-12197 12230-13623 13659-15412 15445-17288 17319-18829 18870-2064 +5 print $r{ 100 };; 1001332-1002376 1002412-1003426 1003466-1005074 1005103-1006637 100667 +2-1007719 1007753-1009667 1009694-1011094 print $r{ 1000 };; 10000503-10002289 10002319-10003488 10003529-10004952 10004980-1000645 +9 10006492-10008122 10008165-10010163

As you can see, partitioning the dataset this way allows you to zero in on a very small subset of ranges by a simple integer division and hash lookup. After that, you have only half a dozen or so ranges to consider which should be fast enough that you don't need to do anything sophisticated.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: Range overlap (with a lot of ranges)
by BioLion (Curate) on May 24, 2010 at 15:16 UTC

    Thanks for the info - As you say narrowing the field quickly is the key bottleneck, and I think you approach is a good one - I might follow this approach to speed up the binary search. If I get the time to properly compare the methods, I will definitely include this idea. Cheers!

    Just a something something...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://841401]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2019-11-17 00:08 GMT
Find Nodes?
    Voting Booth?
    Strict and warnings: which comes first?

    Results (85 votes). Check out past polls.