I have a data frame that looks like this:
0 25 27 60454 2 2 3 2 2 3 3 3 2 3 1 1 2 1 2 2 3 3 2 1 2 1 1 1 2 3
0 19 33 60466 2 2 3 2 2 2 3 3 3 3 1 1 2 1 2 3 3 3 2 1 2 3 2 2 3 3
0 25 27 60692 2 2 3 2 2 3 3 3 2 3 1 1 2 1 2 2 3 3 2 1 2 1 1 1 2 3
0 50 2 60727 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 48 4 60814 1 1 1 1 1 1 2 1 1 3 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
0 46 6 60866 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 2
0 25 27 60882 2 2 3 2 2 3 3 3 2 3 1 1 2 1 2 2 3 3 2 1 2 1 1 1 2 3
0 48 4 60888 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2
0 50 2 60909 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
and it goes on this way for quite a while. I want to separate and reprint lines from the data frame, sorting by the values in 4th column. My sorting parameters are the intervals specified in another data frame:
chrX 1 1000001
chrX 100001 1100001
chrX 200001 1200001
chrX 300001 1300001
chrX 400001 1400001
chrX 500001 1500001
chrX 600001 1600001
chrX 700001 1700001
chrX 800001 1800001
so I'd want all lines with 4th column values within an interval to be printed to it's own text file. I can do this when the intervals don't overlap using the code below, but when the intervals overlap as they do above I lose data from overlapping regions
my $placeholder = 0;
my $count = 0;
#my $window = "1Mbwindow_overlapping100kb";
my $interval_directory = "/Users/logancurtis-whitchurch/Desktop/IB_Sen
+ior_Thesis/galaxy_chrX_data/";
my $input_interval = "$interval_directory"."chrX_"."$window".".txt";
my $cg_input = "/Users/logancurtis-whitchurch/Desktop/IB_Senior_Thesis
+/CompleteGenomics/All26.females/CGS.All26.txt";
#my $output_directory = "/Users/logancurtis-whitchurch/Desktop/IB_Seni
+or_Thesis/temps/overlapping/1Mb/";
open (INTERVAL, "$input_interval") or die "can't open $input_interval\
+n";
my $interval = <INTERVAL>;
my (@find_interval, $start, $end);
open (CG, "<$cg_input") or die "can't open $cg_input\n";
my @SNPs = <CG>;
close(CG);
foreach my $interval (<INTERVAL>){
@find_interval = split(/\t/, $interval);
$start = $find_interval[1];
$end = $find_interval[2];
my $tmp = "temp_file_"."$count".".txt";
my $output_file = "$output_directory" . "$tmp";
open (OUT, ">>$output_file");
my $switch = 1;
while ($switch == 1) {
my @get_SNPs = split(/\t/, $SNPs[$placeholder]);
my $position = $get_SNPs[3];
if (($position < $start) && ($position < $end)) {
$placeholder++;
}
if (($position >= $start) && ($position <= $end)) {
print OUT "@get_SNPs";
$placeholder++;
}
else {
$switch =! 1;
}
}
$count++;
}
close(INTERVAL);
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.