Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Hi Monks,

This is a bioinformatics problem, but I think it is generally applicable, so I hope someone has some input:

The problem: Given a test range X, and N previously defined ranges, define which range (if any) X overlaps with.

Range overlap has been done to death already ( overlap between ranges, Stack overflow, Number::Range, etc... ) but I hoped someone would have some input on doing it on a large scale:

The Issue:

    My main problem is that I have a lot of pre-defined ranges (~10^6) and a lot of tests to carry out (>10^7), so speed is desirable...
    Because I have so many ranges to test against, storing them in memory requires keeping things Lite...

Potential Solutions:

    The ranges can all be sorted, and are fairly evenly distributed, so I could do some sort of 'guess then home in' algorithm.
    I could 'stringify' the ranges i.e.
    # range A : 1-3 # range B : 5-6 my $range_string = '-AAA-BB---';
    Then simply lookup which Range fits my test. Alternately a lookup table / DB approach could be used to the same ends.
    Alternate solutions welcomed...

So my question is whether anyone has come up against a similar problem, or tested any of these solutions? I think I will go with the 'lookup' method, but I wanted to check if anyone had any input. Obviously I would like to do things properly and benchmark proper comparisons, but i am a bit pushed for time.

Thanks in advance,

Update for those interested:

Turns out the 'giant array' idea is *very* memory hungry (please don't point out that this was obvious, we all live in hope...). So I am left with storing ordered ranges and doing a binary search (Thanks Marshall). I wanted to check on the various claims for memory usage made below, so here is what I came up with:

use warnings; use strict; use Devel::Size qw/total_size/; use Benchmark; ++$|; print ">>Strings:\n"; my $t0 = Benchmark->new; for (10e3, 10e4, 10e5,){ my %r; my($i,$l,$g) = (0) x 3; for( 1 .. $_ ) { $r{ int( $i / 10e3 ) } .= qq[ $i -@{[ $i + ( $l = 1500 + int( -500 + rand 1000 ) ) ] +} ]; $i+=$l + 25 + int( rand 25 ); } print "$_ : " . (total_size(\%r) ) . "\n"; } my $t1 = Benchmark->new; my $td = timediff($t1, $t0); print "String took:",timestr($td),"\n"; print ">>AoA:\n"; $t0 = Benchmark->new; for (10e3, 10e4, 10e5,){ my %r; my($i,$l,$g) = (0) x 3; for( 1 .. $_ ) { push @{ $r{ int( $i / 10e3 ) } }, [ $i, @{[ $i + ( $l = 1500 + int( -500 + rand 1000 ) ) ]} +]; $i+=$l + 25 + int( rand 25 ); } print "$_ : " . (total_size(\%r) ) . "\n"; } $t1 = Benchmark->new; $td = timediff($t1, $t0); print "Array took:",timestr($td),"\n"; print system('perl -v');
Outputting :
>>Strings: 10000 : 274788 100000 : 2949952 1000000 : 32010214 String took: 5 wallclock secs ( 5.55 usr + 0.06 sys = 5.61 CPU) >>AoA: 10000 : 1597360 100000 : 15972002 1000000 : 160299622 Array took:23 wallclock secs (21.97 usr + 0.30 sys = 22.27 CPU) This is perl, v5.8.8 built for MSWin32-x86-multi-thread (with 18 registered patches, see perl -V for more detail) Binary build 822 [280952] provided by ActiveState http://www.ActiveSta +te.com Built Jul 31 2007 19:34:48 Haha. I am fully aware how old this is, but you try convincing my work + that..
Clearly the string method is a lot faster to pre-compute and more memory efficient, and will hopefully make up for the taime taken to subsequently parse the string. It also provides a simple 'jumping' mechanism for the binary search. SO big thanks to BrowserUk. THe method is also nicely 'tunable' in that you can make the hash bigger (use more memory, but reduce the lookup window) or vice versa depending on your needs.

I will post more once i also test the lookup method, and also salva's suggestions. Thanks all round - I may actually get this done today.

Just a something something...

In reply to Range overlap (with a lot of ranges) by BioLion

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others pondering the Monastery: (8)
    As of 2019-11-18 17:09 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      Strict and warnings: which comes first?



      Results (90 votes). Check out past polls.

      Notices?