G'day mbp,
Welcome to the monastery.
To get around potential memory issues (due to 4-5 Gb files), you can use Tie::File.
This will not load the entire file into memory.
In the following code, I just keep an index of the elements in an array.
I'm making a very rough guess (as I don't know the record size) that will need ~100 Mb.
You give an example of 500 locations for a subset.
Assuming that's a realistic number, the memory required for the other data structures is minimal.
Here's my test script:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Tie::File;
my $subset_size = 4;
my $min_distance = 20;
my (%sample, @keep);
tie my @locations, 'Tie::File', './pm_1084445_locations.txt';
my $last_index = $#locations;
my @indexes = 0 .. $last_index;
for (0 .. $last_index) {
my $rand_index = int rand @indexes;
my ($chr, $pos) = (split ' ', $locations[$indexes[$rand_index]])[0
+, 1];
if (is_new_location(\%sample, $min_distance, $chr, $pos)) {
push @keep, $indexes[$rand_index];
add_location(\%sample, $chr, $pos);
}
splice @indexes, $rand_index, 1;
last if @keep == $subset_size;
}
if (@keep < $subset_size) {
warn 'WARNING! Subset size [', scalar @keep,
"] is less than the required size [$subset_size].\n";
}
else {
print "$_\n" for @locations[sort { $a <=> $b } @keep];
}
untie @locations;
sub is_new_location {
my ($sample, $min, $chr, $pos) = @_;
return 1 unless $sample->{$chr};
for (@{$sample->{$chr}}) {
return 0 if abs($_ - $pos) < $min;
}
return 1;
}
sub add_location {
my ($sample, $chr, $pos) = @_;
my $index = 0;
for (@{$sample->{$chr}}) {
last if $_ > $pos;
++$index;
}
splice @{$sample->{$chr}}, $index, 0, $pos;
return;
}
The code is fairly straightforward; however, if there's something you don't understand, feel free to ask.
Obviously, adjust $subset_size, filename, etc. to suit.
The test data I used and some sample runs are in the spoiler below.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.