Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: sorting very large text files

by MadraghRua (Vicar)
on Dec 21, 2009 at 21:11 UTC ( #813766=note: print w/ replies, xml ) Need Help??


in reply to sorting very large text files

hello rnaeye (nice pun)

Check out the Samtools project - I think over on sourceforge. Depending upon the aligner you are using, there are either tools to munge the data into a SAM or BAM file or your aligner may already produce one. SAM is the generic alignment format for short reads - BAM is the binary equivalent. Once you have the file into SAM or BAM format, you can use SAM tools to sort and create an index of the short reads. From there you can use the indexes to pull out chromosome specific reads, etc. You'll then need something like a BAM->BED converter to use downstream tools or to display in the genome browser of your choice. I typically find converting a SAM or read file into BAM reduces the size of the file to about 10% of the text size. Of course you then need to be able to read the BAM file and that is where the downstream tools like Picard, Bio-SAMTools (Perl), Pysam and cl-sam come in.

If you want to roll your own, follow the good monks advice - split the file, parse the aligned reads into different chromosome files and then sort on the individual chromosomes in each file. You can concatenate it all back together in the end.

Good luck!

MadraghRua
yet another biologist hacking perl....


Comment on Re: sorting very large text files

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://813766]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2014-10-26 00:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (149 votes), past polls