Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: Sorting Data By Overlapping Intervals

by graff (Chancellor)
on Oct 31, 2013 at 09:38 UTC ( #1060546=note: print w/replies, xml ) Need Help??

in reply to Sorting Data By Overlapping Intervals

I'm having trouble understanding... So, the first data snippet (lots of columns, first column is always "0") is supposed to be read in via the "CG" file handle, and the second data snippet (3 columns, first column is always "chrX") is read in via "INTERVAL", right?

And, if the latter data snippet is "realistic", then your goal is to produce nine distinct text files, with lots of overlapping content across those nine files - for example, if a given line from the CG input has a value of 999999 in column 4, it will be included in all nine outputs, right? (Because "999999" falls within the range for all nine intervals in that data snippet.)

If I have that right, then I think you'll be better off if you read the interval data first, create a hash containing file handles for the intervals along with their min and max values. Then read the CG data; as you look at each CG record, loop over the hash of intervals and print to each of the file handles where it belongs. Something like:

# set up your path strings for the two input files... then: my %intervals; open( INTERVALS, $interval_path ) or die "$interval_path: $!\n"; while (<INTERVALS>) { chomp; my ( $str, $min, $max ) = split; next unless ( $min =~ /^\d+$/ and $max =~ /^\d+$/ ); my $out_path = "...."; # whatever makes a good name for this outp +ut... open( my $ofh, '>', $out_path ) or die "$out_path: $!\n"; $intervals{$out_path} = { 'min' => $min, 'max' => $max, 'fh' => $o +fh }; } open( CG, $cg_path ) or die "$cg_path: $!\n"; while (<CG>) { my $keyval = ( split )[3]; for my $output ( keys %intervals ) { if ( $keyval >= $intervals{$output}{'min'} and $keyval <= $intervals{$output}{'max'} ) { print { $intervals{$output}{'fh'} } $_; } } }
(not tested)

UPDATE: If your "intervals" list is really a lot longer than the 9-line snippet that you showed us, you might not be able to have that many output file handles open at once. In that case, move the open statement out of the first while loop (don't store an 'fh' element in the hash), and put it just before the print statement of the second loop (and change to append mode) - i.e.:

... while (<CG>) { my $keyval = ( split )[3]; for my $output ( keys %intervals ) { if ( $keyval >= $intervals{$output}{'min'} and $keyval <= $intervals{$output}{'max'} ) { open( my $ofh, '>>', $output ) or die "$out_path: $!\n"; print $ofh $_; } } }
By using a lexical scalar variable for the file handle, it will be closed automatically at each iteration, which is what you would want in this case.

Replies are listed 'Best First'.
Re^2: Sorting Data By Overlapping Intervals
by ccelt09 (Sexton) on Oct 31, 2013 at 10:55 UTC

    yes and yes to the first paragraph's questions

    since the values in the 4th column for the first data file are all within the first interval specified in my second data file, they would all be printed to the same output file, once all of those lines have been printed I wish to increase my interval from (1, 1000001) to (100001, 1100001) and repeat the process (in this case with lines not shown here but that have 4th column values within the second interval.

    my interval will always be one million in size, but the start and end values are increasing by one hundred thousand

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1060546]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2018-06-19 00:42 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (111 votes). Check out past polls.