|laziness, impatience, and hubris|
Building a new file by filtering a randomized old file on two fieldsby mbp (Initiate)
|on Apr 30, 2014 at 07:17 UTC||Need Help??|
mbp has asked for the
wisdom of the Perl Monks concerning the following question:
I am working with biological data consisting of locations on chromosomes. I am trying to randomly subsample this data with the following caveats:
1) all resulting locations in the new subset must be either on different chromosomes or at least a specified distance from each other if on the same chromosome
2) the subset needs to be a specific size (for example, 500 locations)
The data I have look similar to this:
In this case, the first column is the identity of the chromosome, the second and third columns are the position on the chromosome in base pairs (and always will be identical for a given line, as shown), and then there are a number of other columns that are not of direct importance for this exercise, but need to be maintained for the final output. Using the above as an example dataset, I might want to randomly select out two positions (lines) that are at least 20 base pairs apart from each other or on separate chromosomes. Possible outputs from this would include lines 1 and 3, 1 and 4, 2 and 3, or 2 and 4. On the other hand, lines 1 and 2 would not be allowed as the positions are too close (only 10 base pairs apart on the same chromosome). I'm afraid that this is going one or two levels above my skill set with perl. So far, I am fine to make a new array from a random subset of lines:
But I am having a difficult time figuring out how to filter on two fields of my lines. I have looked into 2d arrays with some hope (definitely a new area for me), but I am concerned that the memory usage might be prohibitive as my file sizes can be up to 4-5 Gb. For that matter, I suppose reading the entire file into an array might not be the best strategy in the first place...?
If it is of use, my pseudocode would look something like this:
Does that make sense? Hopefully not too confusing, and any advice would be greatly appreciated. Thanks in advance for your thoughts.