Pre-sorting the data seems like good advice to me,
It is quite the opposite of good advice. Ie very bad advice.
- Sorting is O(N logN). The OPs described processing is O(N).
Sorting does not help the OPs processing at all.
- FASTA file are multi-line record format files.
If you sorted a FASTA file with the system sort utility, it would screw the file up in a completely irrecoverable way.
Eg. This:
c:\test>type 845226.fasta
>uc002yje.1 chr21:13973492-13976330
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
acgcgccctacactctggcatgggggaacccggccccgcagagccctgga
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
>uc002yje.1 chr21:13973492-13976330
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
acgcgccctacactctggcatgggggaaaaaacccggccccgcagagccctgga
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
>uc002yje.1 chr21:13973492-13976330
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
acgcgccctacactctggcatgggggaacccggccccgcagagggccctgga
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
becomes this:
c:\test>sort 845226.fasta
>uc002yje.1 chr21:13973492-13976330
>uc002yje.1 chr21:13973492-13976330
>uc002yje.1 chr21:13973492-13976330
acgcgccctacactctggcatgggggaaaaaacccggccccgcagagccctgga
acgcgccctacactctggcatgggggaacccggccccgcagagccctgga
acgcgccctacactctggcatgggggaacccggccccgcagagggccctgga
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
cccctgccccaccgcaccctggattactgcacgccaagaccctcacctga
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
CTCTGACATTGGAGGACTCCTCGGCTACGTCCTGGACTCCTGCACAAGAG
And so is rendered entirely useless.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.