comment on

You could turn the problem inside out: load the test values into memory then scan the large reference file one line at a time to perform the matching:

#!/usr/bin/perl

use strict;

my $reps = <<REPS;
chr1 100 120 feature1
chr1 200 250 feature2
chr2 150 200 feature1
chr2 280 350 feature1
chr3 100 150 feature2
chr3 300 450 feature2
REPS

my %tests;
while (my $line = <DATA>) {
    $line =~ s/[\n\r]//g;
    my @array = split /\s+/, $line;
    $tests{$array[0]}{$array[1]}{'end'} = $array[2];
    $tests{$array[0]}{$array[1]}{'rep'} = $array[3];
}

open my $repIn, '<', \$reps;

while (<$repIn>) {
    my ($chr, $start, $end, $rep) = split ' ';

    next if !exists $tests{$chr};

    for my $s (keys %{$tests{$chr}}) {
        if ($start <= $tests{$chr}{$s}{'end'}) {
            last if $s >= $end;
            print "$chr $start $end $rep\n";
        }
    }
}

__DATA__
chr2 160 210
[download]

True laziness is hard work

In reply to Re: Reducing memory footprint when doing a lookup of millions of coordinates by GrandFather
in thread Reducing memory footprint when doing a lookup of millions of coordinates by richardwfrancis

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl: the Markov chain saw
	PerlMonks