Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Reducing memory footprint when doing a lookup of millions of coordinates

by richardwfrancis (Beadle)
on Feb 27, 2011 at 09:56 UTC ( #890390=perlquestion: print w/replies, xml ) Need Help??

richardwfrancis has asked for the wisdom of the Perl Monks concerning the following question:

Dear Brothers,

I have a rather large file (640MB) of coordinates that each show the position (start,end) of a particular feature on a particular chromosome. Something like:

chr1 100 120 feature1 chr1 200 250 feature2 chr2 150 200 feature1 chr2 280 350 feature1 chr3 100 150 feature2 chr3 300 450 feature2

Given another set of coordinates and a chromosome number I want to be able to quickly see if there are any features that the coordinates overlap. For example, given:

chromosome = chr2 start = 160 end = 210

I'd like the result:

chr2 150 200 feature1

Importantly I need to do a large number of these types of lookups in one go (~1000 or so).

So far I have the following. Now this works and I'm quite happy with the performance but it takes up 1.6GB memory when I read in the full 640MB file which contains over 5 million lines of coordinates.

If it helps and/or you really want to you can get this file from HERE but you'll have to change the column numbers to pull out the right fields!

#!/usr/bin/perl use strict; my %reps = (); while(my $line = <DATA>){ $line =~ s/[\n\r]//g; my @array = split(/\s+/,$line); $reps{$array[0]}{$array[1]}{'end'} = $array[2]; $reps{$array[0]}{$array[1]}{'rep'} = $array[3]; } my $start = 160; my $end = 210; my $chr = "chr2"; for my $s (sort {$a<=>$b} keys %{$reps{$chr}}){ if ($start <= $reps{$chr}{$s}{'end'}) { last if $s >= $end; print "$chr $s $reps{$chr}{$s}{'end'} $reps{$chr}{$s}{'rep'}\n"; } } __DATA__ chr1 100 120 feature1 chr1 200 250 feature2 chr2 150 200 feature1 chr2 280 350 feature1 chr3 100 150 feature2 chr3 300 450 feature2

My questions are:
a) Is there a better way to do this?
b) Can the memory footprint be reduced any?

Many, many thanks for any advice.

Rich

Replies are listed 'Best First'.
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by GrandFather (Saint) on Feb 27, 2011 at 10:29 UTC

    You could turn the problem inside out: load the test values into memory then scan the large reference file one line at a time to perform the matching:

    #!/usr/bin/perl use strict; my $reps = <<REPS; chr1 100 120 feature1 chr1 200 250 feature2 chr2 150 200 feature1 chr2 280 350 feature1 chr3 100 150 feature2 chr3 300 450 feature2 REPS my %tests; while (my $line = <DATA>) { $line =~ s/[\n\r]//g; my @array = split /\s+/, $line; $tests{$array[0]}{$array[1]}{'end'} = $array[2]; $tests{$array[0]}{$array[1]}{'rep'} = $array[3]; } open my $repIn, '<', \$reps; while (<$repIn>) { my ($chr, $start, $end, $rep) = split ' '; next if !exists $tests{$chr}; for my $s (keys %{$tests{$chr}}) { if ($start <= $tests{$chr}{$s}{'end'}) { last if $s >= $end; print "$chr $start $end $rep\n"; } } } __DATA__ chr2 160 210
    True laziness is hard work
      Hi GrandFather,

      You wont believe me but I thought about this after I posted but I haven't tested it yet! If the database idea proves problematic I think this is the way to go.

      Many thanks for your help and the code to help me out.

      Rich
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by moritz (Cardinal) on Feb 27, 2011 at 10:16 UTC

    One improvement that should be straight forward to implement is to group everything by chromosome first.

    So far example read only those records from the file that are related to chr1. Build your %reps hash from them. Then search for all those coordinates on chr1 that you find interesting.

    Then empty the hash with %reps = () to save memory, and read all chr2 coordinates.

    Other possible improvements: instead of using a hash for the coordinates, use two arrays (for begin and end), and then do a binary search in them.

    Of course a relational database like sqlite or postgres can do all these things for you if you build appropriate indexes over the coloumns.

Re: Reducing memory footprint when doing a lookup of millions of coordinates
by BrowserUk (Patriarch) on Feb 27, 2011 at 10:31 UTC

    First, a question. Is it possible to have two (or more) features that start at the same position in the same chromosome, but end at different positions?

    If so, your current data structure will only record the last one read from the file.

    Assuming that's okay, then moving from using a HoHoHs to a HoHoAs:

    #!/usr/bin/perl -slw use strict; use constant { END => 0, REP => 1 }; my %reps; while(<>){ chomp; my @array = split; $reps{ $array[0] }{ $array[1] } = [ $array[2], $array[3] ]; } my $start = 160; my $end = 210; my $chr = "chr2"; for my $s ( sort { $a <=> $b } keys %{ $reps{ $chr } } ){ if( $start <= $reps{ $chr }{ $s }[ END ] ) { last if $s >= $end; print "$chr $s $reps{ $chr }{ $s }[ END ] $reps{ $chr }{ $s }[ + REP ]\n"; } }

    Will likely save you ~25% of your memory usage and run a little faster.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Cheers BrowserUK,

      I'll definitely give that a go.

      Out of interest is there an advantage to using the constant for END and REP rather than 0 and 1? I've not used that before.

      Many thanks for your help

      Rich
        is there an advantage to using the constant for END and REP rather than 0 and 1?

        Beyond a little extra clarity, no. I did it to make the two versions visibly comparible.

        There might be some extra memory savings to be had if you could give a clearer idea of the numbers involved.

        Ie. How many chromosomes? Approximate maximum lengths of both the chromosomes and the ranges?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      By the way. In answer to your question, in this case this shouldn't happen but given the size of the data you're right that it's better to be safe than sorry.

      Rich
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by mellon85 (Monk) on Feb 27, 2011 at 10:24 UTC

    Never considered using a database instead of loading it all in memory? it may not be as fast as it is now (except for the loading time) but the memory footprint IMHO can be heavily reduced

      Hi mellon85 and moritz,

      Thank you for your replies.

      I definitely like the idea of the database. I think this will work very well in the context of the rest of the program I'm writing.

      I'm thinking I could distribute the SQLite database file, preloaded with the feature data and indexed, with the software and query it when needed!

      Many thanks,

      Rich

        The database is a really good idea. I doubt that SQLite keeps the whole database file in memory itself. Could even locate the file to ram disk or similar as a speed bonus.

        You can still use the same code though to do the lookups (with minor modifications):

        1. Get a unique list of all chr.
        2. Foreach chr retrieve all entires (using retrieve_hashref)
        3. use existing code (but work on hashrefs).
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by JavaFan (Canon) on Feb 27, 2011 at 14:39 UTC
    a) Is there a better way to do this?
    b) Can the memory footprint be reduced any?
    Two very different questions. To answer the second question first, an easy way to reduce the memory footprint is for each query, read in the file line by line, and report if it overlaps. I bet you are now saying "but that's too slow". With many problems, there's a trade-off between memory usage, and processing time. Reduce the memory usage, and the processing time goes up. Just asking for "reduce the memory usage" without saying anything about processing time may not get the answer you are looking for.

    As for you first question, it depends. What is "better" in your opinion? *My* opinion of better is to reduce query time, and invest in memory and preprocessing time - build a segment or interval tree and do queries against that. But that will increase your memory usage, so it's probably not better for you.

Re: Reducing memory footprint when doing a lookup of millions of coordinates
by umasuresh (Hermit) on Feb 27, 2011 at 14:10 UTC
    Hi Rich,
    Take a look at my experience with developing an algorithm to deal with something similar Re: Algorithms.
    HTH! Uma
    UPDATE :
    In this task, I did exactly what moritz suggests, I grouped the search chromosome-wise and submitted 24 jobs to a cluster!
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by nikosv (Deacon) on Feb 27, 2011 at 16:29 UTC

    Maybe this is a case for Memoization

    In computing, memoization is an optimization technique used primarily to speed up computer programs by having function calls avoid repeating the calculation of results for previously-processed inputs

    The Memoize module makes it easy to use by wrapping around your function.

    It might increase the memory footprint but make the processing faster so it better to benchmark it,and if you do use it afterall I would be interested in hearing the test results.

    Memoization is not suitable for your project. I've read your question hasty and misinterpreted it

      Wow, I've definitely got plenty to work with here. Thank you all for your help.

      Kind Regards,
      Rich

        Be aware that memoisation will not help your problem at all.

        Memoisation speeds up repetitive calculations by caching the results of those calculations to avoid recalculating them. It uses (often prodigious amounts of) memory to gain speed.

        Since your stated goal is to reduce memory usage; since you have no repetitive calculations; any attempt to use the linked module could not help your performance and would increase your memory footprint, probably to the point of out-of-memory failure.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by tospo (Hermit) on Feb 28, 2011 at 09:50 UTC
    there are quite a few database solutions for this problem. SQL itself isn't that great for range queries but there are binning schemes that basically extent the idea of doing the query by chromosome and narrow down the query into a region. Check out Bio::DB::GFF for example. The underlying binning scheme is described in this 2002 paper by Lincoln Stein et al.
      Postgres may not be portable than SQLite. Postgres has geometric data type and functions, and if you take begin and end for a box of no height on the x axis, Postgres's box operator will work. For exmaple, Begin=100 and End=200 record is coordinate of ((100,0),(200,0)).

      http://www.postgresql.org/docs/8.2/static/functions-geometry.html

      Your sample data of 5301598 records took me an hour to load into default installation of Postgres, but it gives me very quick response.

      ###SQL###
      drop table chromosomes; create table chromosomes( score integer ,div numeric(6,1) ,del numeric(6,1) ,ins numeric(6,1) ,q_sequence text --,begin1 integer --,end1 integer ,pos box ,left_paren text ,repeat1 text ,repeat2 text ,class_family text ,begin2 text ,end2 text ,left_paren2 text ,ID integer ); create index idx_chromosomes_1 on chromosomes(q_sequence); create index idx_chromosomes_2 on chromosomes using gist(pos);
      ###PERL###
      use strict; use warnings; use DBI; my($dbh); system('date'); $dbh = DBI->connect('dbi:Pg:dbname=test1','','',{AutoCommit=>0}) or die $dbh->errstr; #populate($dbh); select_test($dbh); $dbh->disconnect; system('date'); exit 0; sub populate{ my($dbh)=@_; my($sth,$sql,$cnt); $cnt=0; $sql="insert into chromosomes values (?,?,?,?,?,?,?,?,?,?,?,?, +?,?)"; $sth=$dbh->prepare($sql) or die $dbh->errstr; eval{ open my $fh , "<", "hg19.fa.out" or die $!; while(my $line=<$fh>){ $cnt++; next if $cnt <= 3; chomp $line; $line =~ s/^\s+//; my @vals=split(/\s+/, $line); #begin and end to box type $vals[5]="(($vals[5],0),($vals[6],0))"; splice(@vals,6,1); $sth->execute(@vals); } close $fh; $sth->finish; }; if($@){ $dbh->rollback; }else { $dbh->commit; } } sub select_test{ my($dbh)=@_; my($sth,$sql); #SQL for overlaps #$sql="select * from chromosomes where pos && box '((10000,0), +(20000,0))'"; #SQL for include $sql="select * from chromosomes where pos @ box '((10000,0),(2 +0000,0))'"; $sth=$dbh->prepare($sql) or die $dbh->errstr; $sth->execute or die $dbh->errstr; while(my $href= $sth->fetchrow_hashref){ print $href->{pos} . "\n"; } $sth->finish; }
        Interesting, I wasn't aware of these (I'm not using PostgreSQL very much).
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by BrowserUk (Patriarch) on Mar 01, 2011 at 03:24 UTC
    If it helps and/or you really want to you can get this file from HERE but you'll have to change the column numbers to pull out the right fields!

    Which field or fields in the linked file constitute the "feature" field in your example?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Reducing memory footprint when doing a lookup of millions of coordinates
by repellent (Priest) on Mar 04, 2011 at 06:59 UTC
    On my measly VM (256MB memory) running PostgreSQL 8.2 on FreeBSD 6.4, it took less than 8 minutes to load the entire file into a table. Then, given:
    chromosome = chr2 starting = 21000 ending = 23200
    returns the overlaps in less than half a second.

    $ time perl -alne 'print join q{,} => @F[4,5,6,9] if $. > 3' hg19.fa.o +ut | psql -qc " CREATE TABLE tt (chromosome text, starting integer, ending integer, +feature text); COPY tt (chromosome, starting, ending, feature) FROM STDIN WITH CSV; CREATE INDEX idx_tt_chromosome ON tt (chromosome); " perl -alne 'print join q{,} => @F[4,5,6,9] if $. > 3' hg19.fa.out 67. +31s user 82.26s system 71% cpu 3:29.33 total psql -qc 0.27s user 5.68s system 2% cpu 4:15.92 total pgsql=> EXPLAIN ANALYZE pgsql-> SELECT * pgsql-> FROM pgsql-> tt pgsql-> WHERE pgsql-> chromosome = 'chr2' pgsql-> AND 21000 <= ending pgsql-> AND starting <= 23200; QUERY P +LAN ---------------------------------------------------------------------- +------------------------------------------------------------------- Bitmap Heap Scan on tt (cost=516.43..38178.20 rows=2945 width=72) (a +ctual time=129.490..341.722 rows=2 loops=1) Recheck Cond: (chromosome = 'chr2'::text) Filter: ((21000 <= ending) AND (starting <= 23200)) -> Bitmap Index Scan on idx_tt_chromosome (cost=0.00..515.69 rows +=26508 width=0) (actual time=128.165..128.165 rows=414622 loops=1) Index Cond: (chromosome = 'chr2'::text) Total runtime: 341.955 ms pgsql=> SELECT * pgsql-> FROM pgsql-> tt pgsql-> WHERE pgsql-> chromosome = 'chr2' pgsql-> AND 21000 <= ending pgsql-> AND starting <= 23200; chromosome | starting | ending | feature ------------+----------+--------+--------- chr2 | 21258 | 21370 | MIRb chr2 | 22095 | 23394 | L1PA14 (2 rows)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://890390]
Approved by Corion
Front-paged by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2022-12-10 08:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?