No such thing as a small change | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Update: Your "This algorithm" link target has changed (presumably following Consideration). The content it now links to is completely different from the previous content linked to. Ignore any comments I've made which no longer make any sense (I think I've striken most, if not all, of them). An Update in your post would have been useful! "Will cause the entire file to be read through a tiny buffer." The default read cache size is 2MiB. Is this what you're referring to as "a tiny buffer? It can be changed with the memory option: what size would you suggest? "... will require the disk heads to shuffle back and forth all over the disk to locate the randomly chosen records."
" That links to "Perl Cookbook: Recipe 8.6. Picking a Random Line from a File". The OP wants to pick each line at random, not just one line at random.
I have a 10Gb test file (100,000,000 records of 100 bytes each) which I used to test my solution. Not unsurprisingly, this data bore no resemblance to the OP data so I added some additional code to create dummy chromosome and position values:
I also changed the subset size from my original test value of 4 to the OP's example of 500. With twice the expected file size and additional processing, this took almost exactly 40 minutes. I tried another solution (without Tie::File) using tell to create the index and seek to locate the wanted random records: beyond this, the rest of the code was unchanged. This solution also took almost exactly 40 minutes. While other solutions may be faster, I don't consider this to be "horribly slow" or exhibiting "abysmal" performance. -- Ken In reply to Re^3: Building a new file by filtering a randomized old file on two fields
by kcott
|
|