perlquestion
rnaeye
<p>
Hi!
I have very large text files (10GB to 15GB). Each one contains about 40 million lines and 7 fields (see example below).</p>
<code>
File looks like this:
99_999_852_F3 chr9 97768833 97768867 ATTTTCTTCAATTACATTTCCAATGCTATCCCAAA + 35
99_999_852_F3 chr9 97885645 97885679 ATTTTCTTCAaTTACATTTCCAATGCTATCCCAAA + 35
99_99_994_F3 chr10 47028821 47028855 AGACAAAAAGGCCATCAACAGATCAGTAAAGGATC + 35
...
</code>
<p>I need to sort the files based on field-1 (ASCII sorting). I am using Unix
<code> sort -k1 </code> command. Although it works fine, it takes very long time, 30 min to 1 hour. I also tried following Perl script:</p>
<code>
#!/usr/bin/perl
use strict;
use warnings;
open (INFILE, "inputfile.txt") or die $!;
open (OUTFILE, '>', "sorted.txt") or die $!;
foreach (sort <INFILE>){
print OUTFILE $_;
}
close(OUTFILE);
close(INFILE);
exit;
</code>
<p>However, this script puts entire file into memory and sorting process becomes too slow. I was wondering if someone could suggest me a Perl script that will do the sorting faster than Unix <code> sort -k1 </code> command, and will not use too much memory. Thanks.</p>