Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Dictionary-style sort a la Tcl?

by argathin (Acolyte)
on Apr 18, 2002 at 09:39 UTC ( #160157=perlquestion: print w/ replies, xml ) Need Help??
argathin has asked for the wisdom of the Perl Monks concerning the following question:

Hi all!
I've been brooding about this one for some time now: I've got an array of strings with mixed data in it (alphanumerical, punctuation, whitespace). The strings are of variable length. I need to sort that data, but any numbers that show up in the data have to be taken into account. Tcl seems to have such a sort function, and the "lsort -dictionary" man page describes what I'm looking for better than I could...
Use dictionary-style comparison. This is the same as -ascii except (a) case is ignored except as a tie-breaker and (b) if two strings contain embedded numbers, the numbers compare as integers, not characters. For example, in -dictionary mode, bigBoy sorts between bigbang and bigboy, and x10y sorts between x9y and x11y.
I've been rummaging through the archives of "Seekers of Perl Wisdom" as well as Q&A and Google, but all I was able to find were alphabetical sorts (i.e. ignoring case and whitespace).
From what I've seen, the whole problem seems to boil down to finding the right string comparison (which could then be used in sort) - or is that already the wrong approach?

If anyone out there has a solution for this or could at least point me in the right direction, I'd be very grateful...

argathin

Comment on Dictionary-style sort a la Tcl?
Re: Dictionary-style sort a la Tcl?
by ariels (Curate) on Apr 18, 2002 at 10:33 UTC
    You need to write a sub (dictionary?) to pass to sort. Here's an attempt, based on your short explanation...
    #!/usr/local/bin/perl -w use strict; sub dictionary { dictionary_case(lc $a,lc $b) || dictionary_case($a,$b) } sub dictionary_case { my @a = split /(\d+)/, $_[0]; my @b = split /(\d+)/, $_[1]; my $lex = 1; for(;;) { my $x = shift @a; my $y = shift @b; if (! defined $x) { return -(defined $y); } elsif (! defined $y) { return +1; } my $c = $lex ? $x cmp $y : $x <=> $y; $lex = !$lex; return $c if $c; } return 0; } print join "\n", sort dictionary (qw(X10y x9y x bigboy y x11y bigbang bigBoy)), "\n";

    It probably doesn't do what you'd expect for comparing "bigboy" and "big9boys", but your specification seems quiet on the subject...
Re: Dictionary-style sort a la Tcl?
by moodster (Hermit) on Apr 18, 2002 at 10:35 UTC
    As an exercise, I wrote a new comparison routine for use with sort. It splits each string to be compared into parts and if two parts are both numbers, they will be compared numerically. Whitespace is ignored. A shorter string is considered smaller than a longer (that is, "abc" is smaller than "abcde" ).
    sub dictcmp { my $ac = $a; my $bc = $b; while( 1 ) { my @a = $ac =~ /^\s*(([A-Za-z]+)|(\d+))(.*)/; my @b = $bc =~ /^\s*(([A-Za-z]+)|(\d+))(.*)/; return 0 if !defined $a[0] & !defined $b[0]; return -1 if !defined $a[0]; return 1 if !defined $b[0]; my $res; if( $a[0] =~ /\d+/ && $b[0] =~ /\d+/) { $res = $a[0] <=> $b[0]; } else { $res = $a[0] cmp $b[0]; } return $res if $res; $ac = $a[3]; $bc = $b[3]; } }
    Calling this routine optimized would be a blatant lie, and I'm sure a lot of things can be improved. It does work, however, at least for simple input data:
    my @list = ( "x10 y", "1abc", "a10y", "x9y", " b1" ); print join "\n", sort dictcmp @list;
    yields
    1abc a10y b1 x9y x10 y

    Cheers,
    --Moodster

Re: Dictionary-style sort a la Tcl?
by blakem (Monsignor) on Apr 18, 2002 at 10:38 UTC
    Here is a quick late-night attempt.... It assumes the words only consist of [A-Za-z] and relatively small integers. Not perfect, but it does work on your sample dataset. Oh, it also sorts 'aa' before 'a1' which is probably not correct, but you didn't mention it in the requirements.
    #!/usr/bin/perl -wT use strict; { # Create an interleaved 'AaBbCc' mapping my %map; $map{$_} = 2*(ord($_)-ord('A')) for 'A'..'Z'; $map{$_} = 2*(ord($_)-ord('a'))+1 for 'a'..'z'; # use interleaved map to munge word into a string that will sort "co +rrectly" sub dictize { my $word = shift; $word =~ s/(\d+|[A-Za-z])/chr($map{$1}||60+$1)/ge; return $word; } } my @words = qw(bigBoy bigbang bigboy x10y x9y x11y); # Sort based on lowercased munged words, # use the non-lowercased version as a tiebreaker my @sorted = map {$_->[0]} sort {$a->[1] cmp $b->[1] or $a->[2] cmp $b->[2]} map {[$_,dictize(lc),dictize($_)]} @words; print "$_\n" for @sorted; __END__ bigbang bigBoy bigboy x9y x10y x11y

    -Blake

Re: Dictionary-style sort a la Tcl?
by Juerd (Abbot) on Apr 18, 2002 at 11:43 UTC

    my @sorted = map $_->[0], sort { my $i = 0; { my $A = $a->[1][$i]; my $B = $b->[1][$i]; defined($A) || defined($B) # Stop if both undef and ( defined($A) <=> defined($B) # Defined wins over undef or ( $A !~ /\d/ || $B !~ /\d/ # $A or $B is non-integer ? (lc $A cmp lc $B) # ?? Stringy lowercase || ( $A cmp $B) # -> Tie breaker : $A <=> $B # :: $A and $B are intege +rs ) or ++$i && redo # tie => next part ); } } map [ $_, [ split /(\d+)/ ] ], @unsorted;
    That's my try, and it seems to work perfectly with the given example, and some of my own tries.

    @unsorted
    qw( bigbang x10y x9y bigboy bigBoy x11y )
    @sorted
    qw( bigbang bigBoy bigboy x9y x10y x11y )
    Update - integer-detection altered per argarthin's reply.

    - Yes, I reinvent wheels.
    - Spam: Visit eurotraQ.
    

      First of all: A big THANKS to all who responded! The solutions offered by ariels and Juerd seem to come the closest to what I need (BTW: Thanks for pointing out that I need to rethink my requirements!). blakem's solution probably won't work, as I can't make the assumption mentioned by him about the data I get and moodster's solution doesn't seem to work for all cases (though I didn't quite find out why).
      In any case, I learned a few more things, which is good... :-)

      As for Juerd's code:
      That's a Schwartzian Transform, right? (Didn't know it existed until today... :-})
      However, there seems to be one problem: I get quite a few of Argument "" isn't numeric in numeric comparison (<=>) ... if any of the strings starts with a number. If I understand your code correctly, the problem lies in the : $A <=> $B part. When the string starts with a number, the split generates a first field that's defined, but empty. After replacing $A =~ /\D/ || $B =~ /\D/ with $A !~ /\d+/ || $B !~ /\d+/ (is there a better way?), it worked.

      There's also a second problem, but that's partly due to my apparently incomplete requirements - I wasn't aware of that until I experimented a bit with the solutions offered here. Take the following data set:

      Unsorted:x10y 1abc a10y x9y b1 abc DEF 123DEF bigboys big9boys 123DEFG 123def

      Using your code or ariels' code, the result is:

      Sorted:1abc 123DEF 123def 123DEFG a10y abc b1 big9boys bigboys DEF x9y x10y

      Whether "big9boys" should appear before "bigboys" is probably debatable - I'm not 100% sure myself wich way round would be better in my case...
      I am, however, wondering about "1abc" coming before "123DEF" (and the other "123..."), though I can't see a way of changing that without breaking the other requirements. I'll have to think about it.

      Thanks again,
      argathin

        Actually, I think that would be correct. 1 would come before 123 in numeric sequence (according to your definition of the -dictionary sort). Also, I think big9boys would appear before bigboys, but just because in ascii order the digits come before letters.

        hat's a Schwartzian Transform, right?

        It is. In short: by creating a datastructure that you store together with the original string, you don't have to re-build the original string and don't have to take apart the string at every iteration.

        After replacing $A =~ /\D/ || $B =~ /\D/ with $A !~ /\d+/ || $B !~ /\d+/ (is there a better way?), it worked.

        Your correction was correct. You can actually even drop the plusses. I'll update my code right away.

        I am, however, wondering about "1abc" coming before "123DEF" (and the other "123...")

        As I see it, that is correct. Integers should be used as such, according to your definition. That was the challenge :) First, the string is split to integer and non-integer parts, and then they are compared to eachother.

        out of (1, abc) and (123, DEF), the first wins because 1 < 123.

        - Yes, I reinvent wheels.
        - Spam: Visit eurotraQ.
        

      Wow, I feel like the ultimate necroposter here.

      I stumbled across this solution, and found it fell over when it met the likes of the following:

      qw{ 1.01 1.3 1.02 1.2 }
      $A <=> $B does a numerical comparison, but strips any leading zeroes from the number, so how 1.02 and 1.2 are sorted will depend on their original order in the list - Tcl's dictionary sort doesn't seem to do this, and will place 1.2 above 1.02 in the sort.

      So I extended your example to check for the string length of the compared numbers if they match numerically - If they match numerically, but have differing string lengths, then one must have leading zeroes. I also implemented the code into a subroutine:

      sub dict_sort { my @unsorted = @_; my @sorted = map $_->[0], sort { my $i = 0; { my $A = $a->[1][$i]; my $B = $b->[1][$i]; defined($A) || defined($B) # Stop if both undef and ( defined($A) <=> defined($B) # Defined wins over undef or ( $A !~ /\d/ || $B !~ /\d/ # $A or $B is non-integer ? (lc $A cmp lc $B) # ?? Stringy lowercase || ( $A cmp $B) # -> Tie breaker : $A <=> $B # :: $A and $B are integers or ( length($A) <=> length($B) # If numeric comparison ret +urns the same, check length to sort by leading zeroes ) ) or ++$i && redo # tie => next part ); } } map [ $_, [ split /(\d+)/ ] ], @unsorted; return @sorted; }

      I'm not sure that this thread will ever get any more posts, but hopefully this helps somebody out!

        I stumbled across this solution, and found it fell over when it met the likes of the following: qw{ 1.01 1.3 1.02 1.2 };

        $A <=> $B does a numerical comparison, but strips any leading zeroes from the number, so how 1.02 and 1.2 are sorted will depend on their original order in the list

        No, the spaceship operator  <=> does no stripping of any kind, it does a numerical comparison, and numbers don't have leading zeros , 0002 is 02 is 2

        The problem is with the code/regex, which turns decimals into an array of integers , and then does a numerical comparison on each portion -- not gonna work

        I'm not too familiar with Tcl's sort order, but what works for me (esp for mp3s) is to lowercase the string, and pad all digits with zeroes (sometimes 6, sometimes 20)

        #!/usr/bin/perl -- use strict; use warnings; my @list = ( qw{ bigBoy bigbang bigboy x10y x9y x11y 1.01 1.3 1.02 1.2 x1.1y x1.01y x2.3y x2.1y } ); print join "\n", map { $$_[0] } sort { $$a[1] cmp $$b[1] } map { my $f = $_; $f =~ s/(\d+)/sprintf '%06d',$1/ge; [ $_, lc $f ] } @list; __END__ 1.01 1.02 1.2 1.3 bigbang bigBoy bigboy x1.1y x1.01y x2.1y x2.3y x9y x10y x11y

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://160157]
Approved by blakem
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (11)
As of 2014-12-27 19:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (177 votes), past polls