Re: Sorting Data By Overlapping Intervals

in reply to Sorting Data By Overlapping Intervals

I'm having trouble understanding... So, the first data snippet (lots of columns, first column is always "0") is supposed to be read in via the "CG" file handle, and the second data snippet (3 columns, first column is always "chrX") is read in via "INTERVAL", right?

And, if the latter data snippet is "realistic", then your goal is to produce nine distinct text files, with lots of overlapping content across those nine files - for example, if a given line from the CG input has a value of 999999 in column 4, it will be included in all nine outputs, right? (Because "999999" falls within the range for all nine intervals in that data snippet.)

If I have that right, then I think you'll be better off if you read the interval data first, create a hash containing file handles for the intervals along with their min and max values. Then read the CG data; as you look at each CG record, loop over the hash of intervals and print to each of the file handles where it belongs. Something like:

# set up your path strings for the two input files... then:

my %intervals;
open( INTERVALS, $interval_path ) or die "$interval_path: $!\n";
while (<INTERVALS>) {
    chomp;
    my ( $str, $min, $max ) = split;
    next unless ( $min =~ /^\d+$/ and $max =~ /^\d+$/ );

    my $out_path = "....";  # whatever makes a good name for this outp
+ut...

    open( my $ofh, '>', $out_path ) or die "$out_path: $!\n";
    $intervals{$out_path} = { 'min' => $min, 'max' => $max, 'fh' => $o
+fh };
}

open( CG, $cg_path ) or die "$cg_path: $!\n";
while (<CG>) {
    my $keyval = ( split )[3];
    for my $output ( keys %intervals ) {
        if ( $keyval >= $intervals{$output}{'min'} and
             $keyval <= $intervals{$output}{'max'} ) {
            print { $intervals{$output}{'fh'} } $_;
        }
    }
}
[download]

(not tested)

UPDATE: If your "intervals" list is really a lot longer than the 9-line snippet that you showed us, you might not be able to have that many output file handles open at once. In that case, move the open statement out of the first while loop (don't store an 'fh' element in the hash), and put it just before the print statement of the second loop (and change to append mode) - i.e.:

...
while (<CG>)
{
    my $keyval = ( split )[3];
    for my $output ( keys %intervals )
    {
        if ( $keyval >= $intervals{$output}{'min'} and
             $keyval <= $intervals{$output}{'max'} )
        {
            open( my $ofh, '>>', $output ) or die "$out_path: $!\n";
            print $ofh $_;
        }
    }
}
[download]

By using a lexical scalar variable for the file handle, it will be closed automatically at each iteration, which is what you would want in this case.

In Section Seekers of Perl Wisdom