|
|
| Perl: the Markov chain saw | |
| PerlMonks |
Best approach for large-scale data processingby iangibson (Scribe) |
| on Jul 13, 2012 at 16:18 UTC ( [id://981656]=perlquestion: print w/replies, xml ) | Need Help?? |
|
iangibson has asked for the wisdom of the Perl Monks concerning the following question: I'd like to create a Perl program that accepts as input one file of genome loci (potentially 10's of GB in size) and one region file (typically around 1GB), and I'm trying to think of the best approach to filter the loci file based on the coordinates in the region file. The loci file looks like this (the lines are truncated here; the important columns are the first two: chromosome and position, although I'd keep the whole like for each match):
The region file looks like this:
The first column is the chromosome, the second column is the start coordinate, and the third column is the end coordinate. What I'd like to do is only keep the lines from the first (loci) file whose 'POS' is between the start and end coordinates on the second (region) file. A nested while loop would seem to be grossly inefficient, so I was thinking of building a hash of arrays. But before I do, I'd like to draw on the wisdom of the Monks on whether there's a better approach. Execution speed and memory efficiency are paramount. Suggestions much appreciated.
Back to
Seekers of Perl Wisdom
|
|
||||||||||||||||||||||||||||||||||||||||||