Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Sorting characters within a string

by robsv (Curate)
on Aug 24, 2001 at 03:14 UTC ( #107537=perlquestion: print w/replies, xml ) Need Help??

robsv has asked for the wisdom of the Perl Monks concerning the following question:

I am calling an external routine which returns a string containing 2, 3, or 4 letters in the order in which they are read. I need to sort this string before outputting it. For example, if $bases = 'GCT', I need to change $bases to 'CGT' (the fine print: I'm playing with DNA, so the alphabet is 'ATCGN'). I'm currently doing this:
$bases = join '',sort split('',$bases);
...which seems like a bit of overkill if the string will always be 2-4 characters. Since There's More Than One Way To Do It, I was wondering what other ways there were to do it. (This isn't meant to be a Golf question, but golfers are welcome!)

- robsv

Replies are listed 'Best First'.
Re: Sorting characters within a string
by kjherron (Pilgrim) on Aug 24, 2001 at 04:13 UTC
    I can think of a couple other ways to do it, but they're both worse than yours unless you're having performance problems:

    1) Generate every possible string and its sorted version, storing them in a hash with the unsorted string as the key & the sorted string as the value. There's only, what, 45 possible strings? That's doable.

    2) Split the string into characters, count the number of each character, then output the characters in order based on the counts. This is O(n) so it'd be a win if your strings were really long, but it's just overkill for these short strings.

    If performance is a problem, a fairly painless thing to do is cache the sorted strings as you calculate them:

    if (!exists $sort_cache{$bases}) { $sort_cache{$bases} = join( '', sort split('', $bases)); } return $sort_cache{$bases};
    This is of course just a lazy variant on #1 above.
      I think you're right that it would be more work initially for the "all possibilites" hash. I'm no mathematician/statistican but I think there are more like 5! = 120 possibilities (and I certainly wouldn't want to build that hash by hand). Tilly, you're a mathematician. What are the correct number of possibilities?

      Building the hash programmatically would be an interesting brain teaser.

      Update: This was assuming string lengths of up to 5.

      If the code and the comments disagree, then both are probably wrong. -- Norm Schryer

        Are duplicates allowed? If so then the correct number for 1 is 5, for 2 is 5*5=25, for 3 is 5*5*5=125, and for 4 is 5*5*5*5=625. For all strings of length 2-4 that comes out to a grand total of 775.

        Were I autogenerating, my approach might be as follows (untested):

        { my @c = qw(A T C G N); my @strings = @c; foreach (1..5) { foreach (@strings) { $sorted_str{$string} = join '', sort, split //; } @strings = map { my $string = $_; map $string.$_, @c; } @strings; } }
        Note that the nested map will be much slower than you think if you are pre 5.6.1. Personally I would be inclined to use the Orcish (for "Or Cache") maneuver for this:
        $bases = $sorted{$bases} ||= join '', sort, split //, $bases;
        Building the hash programmatically would be ani nteresting brain teaser.

        Here is the worst way to do it:
        my @strings = (grep /[acgmt]{2}/, ('aa' .. 'tt'), grep /[acgmt]{3}/, ('aaa' .. 'ttt'), grep /[acgmt]{4}/, ('aaaa' .. 'tttt')); my %sort_cache; for my $key (@strings) { $sort_cache{$key} = join '',sort split('',$key); }

        Hey, don't take this seriously ;-) it does the job but it's so inefficient it's scary.
        Boy, I really suck at this. One more try assuming strings of length 2-4:

        length of 2: 5 . 4 = 20
        length of 3: 5 . 4 . 3 = 60
        length of 4: 5 . 4 . 3 . 2 = 120
        total of 20 + 60 + 120 = 200 possibilities.

        If the code and the comments disagree, then both are probably wrong. -- Norm Schryer

      Nice Idea to precompute the values.

      I got 3901 which represents the entire set of 2-4 letter long unsorted inputs in this alphabet. This of course folds to a very small number of sorted outcomes.

      Here is the code.

      #!/usr/bin/perl use strict; use warnings; my(%pp); my(@acgnt)=( ' ', 'A', 'C', 'G', 'N', 'T' ); my($i); for($i=11;$i<100000;$i++) { my($s, $o, @s); while($i =~ /6/) { $o=index(reverse($i),'6'); $i+=5*10**$o; } $s=sprintf "%04d", $i; @s=split('',$s); @s = map { $acgnt[$_] } @s; $s=join('', @s); $s =~ y/ //d; $pp{$s}=join('', sort(@s)); } #print out the lookup table (not really part of the initializer) my($k, $v); while(($k,$v)=each %pp) { print "$k = $v\n"; }

      This creates a complete list of inputs you could obtain and builds a hash with the outputs you want to display. It does this fairly quickly and would only have to be done at startup time and then your print statement would bacically be print "$pp{$_}\n";

      This could be made into an initializer function or the values could be computed and saved out and then read in for execution of the real program.

Re: Sorting characters within a string
by clintp (Curate) on Aug 24, 2001 at 05:06 UTC
    If we're going for raw speed: don't use perl. :)

    If I were doing this in assembly, and I wanted raw speed I'd:

    • Generate all of the possible combinations and their sorted values, like so: AA => AA, AB => AB, BA => AB, AC => AC, CA => AC.
    • Generate code (don't write it by hand!) that does something along the lines of pseudocode which works for aa, ab, ba, and bb:
      if (substr($base,0,1) eq 'a') { if (substr($base,1,1) eq 'a') { return 'aa'; } if (substr($base,1,1) eq 'b') { return 'ab' } } if (substr($base,0,1) eq 'b') { if (substr($base,1,1) eq 'a') { return 'ab' } if (substr($base,1,1) eq 'b') { return 'bb' } }
    Which means that for any possible code of length n and an alphabet length q there's only n*q possible comparison/jumps to be made at worst case. (AGCA would be translated to AACG using only 7 comparisons and jumps total for example.)

    I'm fairly confident that this would outperform any solution using a hash or a split/join/sort. At least, in assembler. I'm just a little too harried to write code to prove that it might be faster in Perl.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://107537]
Approved by root
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2022-05-19 16:21 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (72 votes). Check out past polls.