Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re: sorting very large text files

by MadraghRua (Vicar)
on Dec 21, 2009 at 21:11 UTC ( #813766=note: print w/replies, xml ) Need Help??

in reply to sorting very large text files

hello rnaeye (nice pun)

Check out the Samtools project - I think over on sourceforge. Depending upon the aligner you are using, there are either tools to munge the data into a SAM or BAM file or your aligner may already produce one. SAM is the generic alignment format for short reads - BAM is the binary equivalent. Once you have the file into SAM or BAM format, you can use SAM tools to sort and create an index of the short reads. From there you can use the indexes to pull out chromosome specific reads, etc. You'll then need something like a BAM->BED converter to use downstream tools or to display in the genome browser of your choice. I typically find converting a SAM or read file into BAM reduces the size of the file to about 10% of the text size. Of course you then need to be able to read the BAM file and that is where the downstream tools like Picard, Bio-SAMTools (Perl), Pysam and cl-sam come in.

If you want to roll your own, follow the good monks advice - split the file, parse the aligned reads into different chromosome files and then sort on the individual chromosomes in each file. You can concatenate it all back together in the end.

Good luck!

yet another biologist hacking perl....

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://813766]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2018-01-22 08:41 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (233 votes). Check out past polls.