http://www.perlmonks.org?node_id=1060539

ccelt09 has asked for the wisdom of the Perl Monks concerning the following question:

I have a data frame that looks like this:

0 25 27 60454 2 2 3 2 2 3 3 3 2 3 1 1 2 1 2 2 3 3 2 1 2 1 1 1 2 3 0 19 33 60466 2 2 3 2 2 2 3 3 3 3 1 1 2 1 2 3 3 3 2 1 2 3 2 2 3 3 0 25 27 60692 2 2 3 2 2 3 3 3 2 3 1 1 2 1 2 2 3 3 2 1 2 1 1 1 2 3 0 50 2 60727 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 48 4 60814 1 1 1 1 1 1 2 1 1 3 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 0 46 6 60866 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 2 0 25 27 60882 2 2 3 2 2 3 3 3 2 3 1 1 2 1 2 2 3 3 2 1 2 1 1 1 2 3 0 48 4 60888 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 0 50 2 60909 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

and it goes on this way for quite a while. I want to separate and reprint lines from the data frame, sorting by the values in 4th column. My sorting parameters are the intervals specified in another data frame:

chrX 1 1000001 chrX 100001 1100001 chrX 200001 1200001 chrX 300001 1300001 chrX 400001 1400001 chrX 500001 1500001 chrX 600001 1600001 chrX 700001 1700001 chrX 800001 1800001

so I'd want all lines with 4th column values within an interval to be printed to it's own text file. I can do this when the intervals don't overlap using the code below, but when the intervals overlap as they do above I lose data from overlapping regions

my $placeholder = 0; my $count = 0; #my $window = "1Mbwindow_overlapping100kb"; my $interval_directory = "/Users/logancurtis-whitchurch/Desktop/IB_Sen +ior_Thesis/galaxy_chrX_data/"; my $input_interval = "$interval_directory"."chrX_"."$window".".txt"; my $cg_input = "/Users/logancurtis-whitchurch/Desktop/IB_Senior_Thesis +/CompleteGenomics/All26.females/CGS.All26.txt"; #my $output_directory = "/Users/logancurtis-whitchurch/Desktop/IB_Seni +or_Thesis/temps/overlapping/1Mb/"; open (INTERVAL, "$input_interval") or die "can't open $input_interval\ +n"; my $interval = <INTERVAL>; my (@find_interval, $start, $end); open (CG, "<$cg_input") or die "can't open $cg_input\n"; my @SNPs = <CG>; close(CG); foreach my $interval (<INTERVAL>){ @find_interval = split(/\t/, $interval); $start = $find_interval[1]; $end = $find_interval[2]; my $tmp = "temp_file_"."$count".".txt"; my $output_file = "$output_directory" . "$tmp"; open (OUT, ">>$output_file"); my $switch = 1; while ($switch == 1) { my @get_SNPs = split(/\t/, $SNPs[$placeholder]); my $position = $get_SNPs[3]; if (($position < $start) && ($position < $end)) { $placeholder++; } if (($position >= $start) && ($position <= $end)) { print OUT "@get_SNPs"; $placeholder++; } else { $switch =! 1; } } $count++; } close(INTERVAL);

Replies are listed 'Best First'.
Re: Sorting Data By Overlapping Intervals
by graff (Chancellor) on Oct 31, 2013 at 09:38 UTC
    I'm having trouble understanding... So, the first data snippet (lots of columns, first column is always "0") is supposed to be read in via the "CG" file handle, and the second data snippet (3 columns, first column is always "chrX") is read in via "INTERVAL", right?

    And, if the latter data snippet is "realistic", then your goal is to produce nine distinct text files, with lots of overlapping content across those nine files - for example, if a given line from the CG input has a value of 999999 in column 4, it will be included in all nine outputs, right? (Because "999999" falls within the range for all nine intervals in that data snippet.)

    If I have that right, then I think you'll be better off if you read the interval data first, create a hash containing file handles for the intervals along with their min and max values. Then read the CG data; as you look at each CG record, loop over the hash of intervals and print to each of the file handles where it belongs. Something like:

    # set up your path strings for the two input files... then: my %intervals; open( INTERVALS, $interval_path ) or die "$interval_path: $!\n"; while (<INTERVALS>) { chomp; my ( $str, $min, $max ) = split; next unless ( $min =~ /^\d+$/ and $max =~ /^\d+$/ ); my $out_path = "...."; # whatever makes a good name for this outp +ut... open( my $ofh, '>', $out_path ) or die "$out_path: $!\n"; $intervals{$out_path} = { 'min' => $min, 'max' => $max, 'fh' => $o +fh }; } open( CG, $cg_path ) or die "$cg_path: $!\n"; while (<CG>) { my $keyval = ( split )[3]; for my $output ( keys %intervals ) { if ( $keyval >= $intervals{$output}{'min'} and $keyval <= $intervals{$output}{'max'} ) { print { $intervals{$output}{'fh'} } $_; } } }
    (not tested)

    UPDATE: If your "intervals" list is really a lot longer than the 9-line snippet that you showed us, you might not be able to have that many output file handles open at once. In that case, move the open statement out of the first while loop (don't store an 'fh' element in the hash), and put it just before the print statement of the second loop (and change to append mode) - i.e.:

    ... while (<CG>) { my $keyval = ( split )[3]; for my $output ( keys %intervals ) { if ( $keyval >= $intervals{$output}{'min'} and $keyval <= $intervals{$output}{'max'} ) { open( my $ofh, '>>', $output ) or die "$out_path: $!\n"; print $ofh $_; } } }
    By using a lexical scalar variable for the file handle, it will be closed automatically at each iteration, which is what you would want in this case.

      yes and yes to the first paragraph's questions

      since the values in the 4th column for the first data file are all within the first interval specified in my second data file, they would all be printed to the same output file, once all of those lines have been printed I wish to increase my interval from (1, 1000001) to (100001, 1100001) and repeat the process (in this case with lines not shown here but that have 4th column values within the second interval.

      my interval will always be one million in size, but the start and end values are increasing by one hundred thousand

Re: Sorting Data By Overlapping Intervals
by hdb (Monsignor) on Oct 31, 2013 at 09:42 UTC

    I would propose two modifications. First, when you load the data from file, already extract the fourth column and store it alongside the lines:

    my @SNPs = map { [ (split /\t/)[3], $_ ] } <CG>;

    So each element of @SNPs is now an array reference, whose first element is the fourth column and the second element is the full line.

    As the second change, in your loop over the intervals pick all elements that fall in this interval using grep and the extract the line from the array reference using map:

    my @inInterval = map { $_->[1] } grep { $start <= $_->[0] and $_->[0] +<= $end } @SNPs;

    All you need then is to print these lines into the relevant file.

    I am not sure whether I explain this well...

      the logic behind this makes sense but once I have each element of  @SNPs stored as an array reference as you explained above i don't understand how to print those falling within the ranges in my second data set to a relevant file

        This is what my second proposal does. If you have the interval boundaries in variables $start and $end, then

        my @inInterval = map { $_->[1] } grep { $start <= $_->[0] and $_->[0] +<= $end } @SNPs;

        will filter all relevant lines for this interval. You would just print OUT @inInterval; where OUT is the file handle for the file corresponding to this interval.

        Something like this:

        open my $CG, "<", $cg_input or die "can't open $cg_input\n"; my @SNPs = map { [ (split /\t/)[3], $_ ] } <$CG>; close($CG); open my $INTERVAL, "<", $input_interval or die "can't open $input_inte +rval\n"; my $interval = <$INTERVAL>; # skip first line foreach (<$INTERVAL>){ chomp; my( $start, $end ) = split /\t/; open my $OUT, ">", $output_directory."temp_file_".$count++.".txt"; + print $OUT map { $_->[1] } grep { $start <= $_->[0] and $_->[0] <= + $end } @SNPs; close $OUT; } close($INTERVAL);
Re: Sorting Data By Overlapping Intervals
by Eily (Monsignor) on Oct 31, 2013 at 09:45 UTC

    The problem is that your $placeholder value isn't set back to 0 when you finish testing a range, so you never go back to test the previous lines. The solution to your problem would be to add a $placeholder = 0 at the end of your foreach loop.

    Now, if you want some advice on your code, first, I would have gone the other way around, have all the ranges in memory, and for each line of your data, test all ranges. And to keep track of the file that needs to be written on, an array of hashes could do the trick. Something like :

    [ { start => 1, end => 1001, file => *FH1 }, { start => 500, end => 2001, file => *FH2 } ]
    Except the content would have been filled by perl instead of by hand like I just did :) . Then the algorithm would have been something like :
    for each line $position = getPosition() for $range in @ranges print $range->{FH} $line if (($position > $range->{start}) && ($ +position < $range->{end}))

    Then, maybe you don't want to rewrite your whole code (that's why I didn't bother much), still, there are some useful things you might like to know. If you want to have a "do something for each value and stop when condition" construct in Perl, you can use the last keyword inside a foreach loop. In your case that would be :

    SNP:for my $snp (@SNPs) { my @get_SNPs = split(/\t/, $snp); my $position = $get_SNPs[3]; last SNP if ($position > $end); # stop reading, we're out of r +ange ! if (($position >= $start) && ($position <= $end)) { print OUT "@get_SNPs"; } }

    Also, instead of

    @array = split /\t/, $string; my $v1 = $array[1]; my $v2 = $array[2];
    you can simply write my (undef, $v1, $v2) = split /\t/, $string;

      The only problem with setting $placeholder=0 is that I would need a way to iteratively increase its value until it is greater than or equal to my next start value before the for loop is completed. Just setting that variable to 0 after the loop means my  $position value is always less than  $start another way of saying that is:

      my position variable resets to the first value of 60454 but $start and $end increase with each loop, so nothing prints after the first output file

        Oh, right, I read an elsif instead of the second if, which meant that you would only have exited the loop when $position is above range. Then resetting $placeholder to 0 would work I guess (untested). But the condition is that your input data has to be sorted (as in ordered), which it seemed to be in your sample.

        Still, you don't check that $placeholder is a valid value, if the last element of @SNP is inside one of the ranges, you'll increase $placeholder and try to access $SNP[last element+1] which would yield undef. I'm not sure you thought of that case.

        In the end, your inner loop reworked would be something like :

        # It would probably be better have # while (my $line = <CG>) # but that would mean rethinking your whole code for my $line (@SNP) { my $position = (split " ", $line)[3]; last unless $position <= $end; print OUT $line if $position > $start; }
        This is of course, completely untested :D.

Re: Sorting Data By Overlapping Intervals
by thezip (Vicar) on Oct 31, 2013 at 14:18 UTC

    Would it be beneficial to use a database, and then compose the appropriate SQL queries to process your data?

    You might benefit from abstracting the file handling part from the data munging part. You'd also be able to tweak the queries and review the outputs until you get them just right.



    What can be asserted without proof can be dismissed without proof. - Christopher Hitchens, 1949-2011
Re: Sorting Data By Overlapping Intervals
by kcott (Archbishop) on Nov 01, 2013 at 09:17 UTC

    G'day ccelt09,

    Here's my take on a solution. I've output to a hash (with possibly unneeded, additional data): easily modified for printing. I've kept the interval data unchanged. I've made slight changes to the line data: most lines match multiple ranges; some match none.

    #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use Inline::Files; my (@range, %capture); my $line_re = qr[^(?:\d+\s+){3}(\d+)]; while (<INTERVALS>) { my ($min, $max) = (split)[1,2]; push @range => [ $min, $max ]; $capture{range}{$min} = $max; } while (<FRAME>) { chomp; collate($_); } { local $Data::Dumper::Indent = 1; local $Data::Dumper::Sortkeys = 1; print Dumper \%capture; } sub collate { my $line = shift; my ($key) = $line =~ $line_re; for (@range) { last if $_->[0] > $key; next if $_->[1] < $key; push @{$capture{data}{$_->[0]}} => $line; } } __INTERVALS__ chrX 1 1000001 chrX 100001 1100001 chrX 200001 1200001 chrX 300001 1300001 chrX 400001 1400001 chrX 500001 1500001 chrX 600001 1600001 chrX 700001 1700001 chrX 800001 1800001 __FRAME__ 0 25 27 260692 2 2 3 2 2 3 3 3 2 3 1 1 2 1 2 2 3 3 2 1 2 1 1 1 2 3 0 19 33 160466 2 2 3 2 2 2 3 3 3 3 1 1 2 1 2 3 3 3 2 1 2 3 2 2 3 3 0 25 27 60454 2 2 3 2 2 3 3 3 2 3 1 1 2 1 2 2 3 3 2 1 2 1 1 1 2 3 0 25 27 3260882 2 2 3 2 2 3 3 3 2 3 1 1 2 1 2 2 3 3 2 1 2 1 1 1 2 3 0 50 2 460727 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 48 4 860814 1 1 1 1 1 1 2 1 1 3 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 0 46 6 1660866 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 2 0 48 4 6460888 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 0 50 2 60909 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

    Output:

    -- Ken