http://www.perlmonks.org?node_id=1084445

mbp has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am working with biological data consisting of locations on chromosomes. I am trying to randomly subsample this data with the following caveats:

1) all resulting locations in the new subset must be either on different chromosomes or at least a specified distance from each other if on the same chromosome

2) the subset needs to be a specific size (for example, 500 locations)

The data I have look similar to this:

chromosome1 100 100 . G T several other columns chromosome1 110 110 . A C several other columns chromosome1 200 200 . C T several other columns chromosome2 125 125 . C T several other columns

In this case, the first column is the identity of the chromosome, the second and third columns are the position on the chromosome in base pairs (and always will be identical for a given line, as shown), and then there are a number of other columns that are not of direct importance for this exercise, but need to be maintained for the final output. Using the above as an example dataset, I might want to randomly select out two positions (lines) that are at least 20 base pairs apart from each other or on separate chromosomes. Possible outputs from this would include lines 1 and 3, 1 and 4, 2 and 3, or 2 and 4. On the other hand, lines 1 and 2 would not be allowed as the positions are too close (only 10 base pairs apart on the same chromosome). I'm afraid that this is going one or two levels above my skill set with perl. So far, I am fine to make a new array from a random subset of lines:

use List::Util qw(shuffle); use strict; my $file = $ARGV[0]; open (VCF,$file); my @array=<VCF>; #read file into array my @newvcf; for (my $i=0; $i<1000000; ++$i) { #giving script plenty of room to w +ork... my $randomline=$array[rand @array]; #randomize lines of file if (scalar @newvcf<2) { push (@newvcf, $randomline); #build new array/subset of li +nes } }

But I am having a difficult time figuring out how to filter on two fields of my lines. I have looked into 2d arrays with some hope (definitely a new area for me), but I am concerned that the memory usage might be prohibitive as my file sizes can be up to 4-5 Gb. For that matter, I suppose reading the entire file into an array might not be the best strategy in the first place...?

If it is of use, my pseudocode would look something like this:

either randomize file first (for example with unix 'shuf') and read fi +rst line of randomized file or slurp entire file into array and then +randomize compare first random line with second random line - IF first field of second line (chromosome) matches first field of fi +rst line AND second field of second line (position) minus the second +field of first line is either less than X or greater than -X, discard + second line - ELSE keep both first and second line compare third random line to first random line as above, and to second + random line if was not discarded and if third line was not discarded + due to comparison with first line continue until a new collection of a specified number of random lines +is generated, with no lines containing positions on the same chromoso +me and within X distance of one another.

Does that make sense? Hopefully not too confusing, and any advice would be greatly appreciated. Thanks in advance for your thoughts.