If we're going for raw speed: don't use perl. :)
If I were doing this in assembly, and I wanted raw speed I'd:
- Generate all of the possible combinations and their sorted values, like so: AA => AA, AB => AB, BA => AB, AC => AC, CA => AC.
- Generate code (don't write it by hand!) that does something along the lines of pseudocode which works for aa, ab, ba, and bb:
if (substr($base,0,1) eq 'a') {
if (substr($base,1,1) eq 'a') {
return 'aa';
}
if (substr($base,1,1) eq 'b') {
return 'ab'
}
}
if (substr($base,0,1) eq 'b') {
if (substr($base,1,1) eq 'a') {
return 'ab'
}
if (substr($base,1,1) eq 'b') {
return 'bb'
}
}
Which means that for any possible code of length n and an alphabet length q there's only n*q possible comparison/jumps to be made at worst case. (AGCA would be translated to AACG using only 7 comparisons and jumps total for example.)
I'm fairly confident that this would outperform any solution using a hash or a split/join/sort. At least, in assembler. I'm just a little too harried to write code to prove that it might be faster in Perl.