http://www.perlmonks.org?node_id=1084445

mbp has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am working with biological data consisting of locations on chromosomes. I am trying to randomly subsample this data with the following caveats:

1) all resulting locations in the new subset must be either on different chromosomes or at least a specified distance from each other if on the same chromosome

2) the subset needs to be a specific size (for example, 500 locations)

The data I have look similar to this:

chromosome1 100 100 . G T several other columns chromosome1 110 110 . A C several other columns chromosome1 200 200 . C T several other columns chromosome2 125 125 . C T several other columns

In this case, the first column is the identity of the chromosome, the second and third columns are the position on the chromosome in base pairs (and always will be identical for a given line, as shown), and then there are a number of other columns that are not of direct importance for this exercise, but need to be maintained for the final output. Using the above as an example dataset, I might want to randomly select out two positions (lines) that are at least 20 base pairs apart from each other or on separate chromosomes. Possible outputs from this would include lines 1 and 3, 1 and 4, 2 and 3, or 2 and 4. On the other hand, lines 1 and 2 would not be allowed as the positions are too close (only 10 base pairs apart on the same chromosome). I'm afraid that this is going one or two levels above my skill set with perl. So far, I am fine to make a new array from a random subset of lines:

use List::Util qw(shuffle); use strict; my $file = $ARGV[0]; open (VCF,$file); my @array=<VCF>; #read file into array my @newvcf; for (my $i=0; $i<1000000; ++$i) { #giving script plenty of room to w +ork... my $randomline=$array[rand @array]; #randomize lines of file if (scalar @newvcf<2) { push (@newvcf, $randomline); #build new array/subset of li +nes } }

But I am having a difficult time figuring out how to filter on two fields of my lines. I have looked into 2d arrays with some hope (definitely a new area for me), but I am concerned that the memory usage might be prohibitive as my file sizes can be up to 4-5 Gb. For that matter, I suppose reading the entire file into an array might not be the best strategy in the first place...?

If it is of use, my pseudocode would look something like this:

either randomize file first (for example with unix 'shuf') and read fi +rst line of randomized file or slurp entire file into array and then +randomize compare first random line with second random line - IF first field of second line (chromosome) matches first field of fi +rst line AND second field of second line (position) minus the second +field of first line is either less than X or greater than -X, discard + second line - ELSE keep both first and second line compare third random line to first random line as above, and to second + random line if was not discarded and if third line was not discarded + due to comparison with first line continue until a new collection of a specified number of random lines +is generated, with no lines containing positions on the same chromoso +me and within X distance of one another.

Does that make sense? Hopefully not too confusing, and any advice would be greatly appreciated. Thanks in advance for your thoughts.

Replies are listed 'Best First'.
Re: Building a new file by filtering a randomized old file on two fields
by kcott (Archbishop) on Apr 30, 2014 at 10:58 UTC

    G'day mbp,

    Welcome to the monastery.

    To get around potential memory issues (due to 4-5 Gb files), you can use Tie::File. This will not load the entire file into memory. In the following code, I just keep an index of the elements in an array. I'm making a very rough guess (as I don't know the record size) that will need ~100 Mb.

    You give an example of 500 locations for a subset. Assuming that's a realistic number, the memory required for the other data structures is minimal.

    Here's my test script:

    #!/usr/bin/env perl use strict; use warnings; use autodie; use Tie::File; my $subset_size = 4; my $min_distance = 20; my (%sample, @keep); tie my @locations, 'Tie::File', './pm_1084445_locations.txt'; my $last_index = $#locations; my @indexes = 0 .. $last_index; for (0 .. $last_index) { my $rand_index = int rand @indexes; my ($chr, $pos) = (split ' ', $locations[$indexes[$rand_index]])[0 +, 1]; if (is_new_location(\%sample, $min_distance, $chr, $pos)) { push @keep, $indexes[$rand_index]; add_location(\%sample, $chr, $pos); } splice @indexes, $rand_index, 1; last if @keep == $subset_size; } if (@keep < $subset_size) { warn 'WARNING! Subset size [', scalar @keep, "] is less than the required size [$subset_size].\n"; } else { print "$_\n" for @locations[sort { $a <=> $b } @keep]; } untie @locations; sub is_new_location { my ($sample, $min, $chr, $pos) = @_; return 1 unless $sample->{$chr}; for (@{$sample->{$chr}}) { return 0 if abs($_ - $pos) < $min; } return 1; } sub add_location { my ($sample, $chr, $pos) = @_; my $index = 0; for (@{$sample->{$chr}}) { last if $_ > $pos; ++$index; } splice @{$sample->{$chr}}, $index, 0, $pos; return; }

    The code is fairly straightforward; however, if there's something you don't understand, feel free to ask.

    Obviously, adjust $subset_size, filename, etc. to suit. The test data I used and some sample runs are in the spoiler below.

    -- Ken

      To get around potential memory issues (due to 4-5 Gb files), you can use Tie::File. This will not load the entire file into memory.

      That code will be horribly slow.

      This single line:

      my $last_index = $#locations;

      Will cause the entire file to be read through a tiny buffer.

      And this line:

      my ($chr, $pos) = (split ' ', $locations[$indexes[$rand_index]])[0 +, 1];

      will require the disk heads to shuffle back and forth all over the disk to locate the randomly chosen records.

      Many parts of the file will be read and re-read many times. Performance will be abysmal.

      This algorithm will fairly pick random lines from using a single pass over that file.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Update: Your "This algorithm" link target has changed (presumably following Consideration). The content it now links to is completely different from the previous content linked to. Ignore any comments I've made which no longer make any sense (I think I've striken most, if not all, of them). An Update in your post would have been useful!

        "Will cause the entire file to be read through a tiny buffer."

        The default read cache size is 2MiB. Is this what you're referring to as "a tiny buffer? It can be changed with the memory option: what size would you suggest?

        "... will require the disk heads to shuffle back and forth all over the disk to locate the randomly chosen records."

        I suspect this is a reference to your "This algorithm" link: see the next point.

        "This algorithm [Link removed - see update above] will fairly pick random lines from using a single pass over that file."

        That links to "Perl Cookbook: Recipe 8.6. Picking a Random Line from a File". The OP wants to pick each line at random, not just one line at random.

        [By the way, the URL associated with that link (i.e. http://docstore.mik.ua/orelly/perl/cookbook/ch08_07.htm) is questionable: it's O'Reilly material provided by someone else. I've seen arguments for and against this specific one, so I'm just pointing it out.]

        I have a 10Gb test file (100,000,000 records of 100 bytes each) which I used to test my solution. Not unsurprisingly, this data bore no resemblance to the OP data so I added some additional code to create dummy chromosome and position values:

        #my ($chr, $pos) = (split ' ', $locations[$indexes[$rand_index]])[ +0, 1]; my ($chr, $pos) = (split '', $locations[$indexes[$rand_index]])[0, + 1]; $chr = 'chromosome' . ($_ % 4 + 1); $pos = $_ + 10;

        I also changed the subset size from my original test value of 4 to the OP's example of 500.

        With twice the expected file size and additional processing, this took almost exactly 40 minutes.

        I tried another solution (without Tie::File) using tell to create the index and seek to locate the wanted random records: beyond this, the rest of the code was unchanged. This solution also took almost exactly 40 minutes.

        While other solutions may be faster, I don't consider this to be "horribly slow" or exhibiting "abysmal" performance.

        -- Ken

Re: Building a new file by filtering a randomized old file on two fields
by Anonymous Monk on Apr 30, 2014 at 07:41 UTC

    So you only ever need at most N-random lines?

    Here is what i'd do in pseudo code

    my @candidates = sort nRandomIntegers( $wantedN ); my @chosen; while( "not done" ){ $line = readline $file; "keep reading if current line is not first candidate";#:) if( satisfyConditions( $lineish , \@chosen ) ){ push @chosen, $lineish; shift @candidates; } "stop reading if chosen wantedN lines"; "pick new candidates when run out of candidates "; "rewind filehandle if reach the end but not wantedN"; "give up, rewinded three times, never gonna happen"; }

    Does that make sense? Any questions?

Re: Building a new file by filtering a randomized old file on two fields
by mbp (Novice) on Apr 30, 2014 at 23:32 UTC
    Hi Perl Monks,

    thanks so much for your quick input. kcott it seems that your script is working quite well, and I greatly appreciate your work, especially going the extra distance to reformat your own data and test out your script(!). You are correct in that the operating time is not a huge issue here - if I can process a file in an hour or less I'll be plenty happy. I'm still digesting a bit the script you wrote, but I do wonder if you or others have any thoughts on changing the read cache size. As you mentioned and from this link:

    http://perldoc.perl.org/Tie/File.html

    (section 'memory')

    it looks like it is easy to adjust the cache size, and that perhaps decreasing the cache memory limit might be appropriate as I am dealing with files of many short records. Perhaps the best thing for me to do will be to benchmark with a few different memory settings.

    Thank you also to sundialsvc4, Anonymous Monk, BrowserUK and RonW for your input. Anonymous Monk and BrowserUK thank you for the pseudocode and that is an interesting blog post and wiki page, definitely reservior sampling seems a good way of approaching the problem of sampling from a large file, although still definitely a fair bit above my head. sundialsvc4 that is also another interesting way to look at it. I am used to thinking in fields/columns rather than bytes. Unfortunately (sorry I should have made this clearer in the original post), while the number of columns is consistent between lines, the number of characters/bytes is not. I really appreciate all of your help!

Re: Building a new file by filtering a randomized old file on two fields
by sundialsvc4 (Abbot) on Apr 30, 2014 at 11:24 UTC

    The fact that you have multi-gigabyte files in this case might work in your favor, allowing you to create a simpleton algorithm that just might nonetheless produce effective results.

    I would suggest treating the files as random-access files containing fixed-length records.   (The end-of-line indicator could be one or two bytes depending on the system.   Just read one line and note the file-position after you do so ... that’s your record-length in bytes.   Wanna tie to a file?   Feel free.™   Doesn’t matter.)

    Let’s get to work.   Generate a list of random integers (modulo the size of the file in records) that is some-“small x” times as long as the number of records you need.   Sort the list (to de-dupe it).   Duplicates are unlikely, will be removed by the sort step, and can be ignored.   (You will still have enough.)   Now, seek and read each of those records, anticipating that this, by itself, will in fact contain an acceptable random sample.   (Suspiciously apply a regular-expression to each line that you know will pass, and die() if it does not, because this would indicate that there is a bug ... and there’s always one more bug.)   Here is your bucket of candidates.

    Now, walk through this bucket like a fisherman, throwing acceptable records into another bucket (the result ...), and skipping those which for any reason do not qualify to be in the result.   (Since the list of candidates is randomly picked from the file, you do not have to again pick at random from that list.   “Double random” is not “more random.”)   You can simply walk the randomly-pulled list from head to tail.   If you fill the results-bucket first, “you win.”   If you reach the end of randomly-pulled records without filling the bucket, you lose ... just try again.   If you try and fail several times, die() in disgrace.   But you probably won’t.   (If you do, bump-up your “small x.”)

    You have, apparently, millions of records from which to randomly draw, and you need a comparatively small sample (500) from that source.   If yours is a truly random draw, where any record anywhere has an equal probability of being picked, then I would expect that within a draw of, say, 2000 records from such a file, you will nearly always find a set of records that will qualify, pretty much no matter how stringent your criteria may be.   This is a simpleton algorithm that, I predict, will get the job done quite nicely because you have a large population to draw from.   Even for large samples.

    Note that a sequential walk through the file, “flipping a coin each time,” is not the same, statistically.   Any actual implementation would heavily favor the head of the file.   It is important that the initial draw from the file should be perfectly pseudo-random, with each and every record having an equal probability of being selected for initial consideration.   Hence, use random access.

      fixed-length records

      Just a note, I believe the OP didn't say that the lines are fixed-length.

      Note that a sequential walk through the file, "flipping a coin each time," is not the same, statistically. Any actual implementation would heavily favor the head of the file.

      I believe this would help: http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/

        That is a very important point, and it is one of the key reasons why I recommended testing each retrieved record against a regex ... if any assumption (such as this) upon which the logic depends is found not to hold, the program must die.   Many data-files of this kind are fixed-length.

        Another trick, which actually works just about as well, is to randomly select byte positions within the file, then seek() to that position, read a line of data and throw it away, then read the next line and keep that.   Verifying, of course, that the second-line appears plausible.   The seek() is presumed to have dropped us smack-dab into the middle of a record, and (for this trick to actually work ...) there must not be funky multi-character issues.   In other words, the program must be able to somehow “land on its feet” when it reads the second-record.

        The algorithm reference looks very nice.   Thanks for sharing.