Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Changing ASCII collating sequence for sort

by ibm1620 (Hermit)
on Apr 17, 2022 at 18:34 UTC ( [id://11143026]=perlquestion: print w/replies, xml ) Need Help??

ibm1620 has asked for the wisdom of the Perl Monks concerning the following question:

I need to change the ASCII collating sequence so that all non-alphanumerics sort together before numerics and alphas, and upper and lower cases are kept together. I thought that I could have the sort function's comparator transliterate @a and @b for the purpose of comparison, but it's failing and I don't see why.
#!/usr/bin/env perl use 5.010; use warnings; use strict; say <<'EOF'; .,-;:!?"'`_#$%&*+/|=@\^~()<>[]{}0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmN +nOoPpQqRrSsTtUuVvWwXxYyZz intended sequence !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcde +fghijklmnopqrstuvwxyz{|}~ natural ASCII sequence EOF my @list = qw{ "Hello" Abel (hello) {adieu} @adieu [goodbye] Charlie ^Charlie ~Adieu zebra 21708 baker . - ; : ! ? " ' ` _ }; my @sorted_list = sort { # for each $a:$b comparison, transliterate $a and $b into temp + vars $x and $y my ($x,$y) = map { my $z = $_ =~ tr / .,-;:!?"'`_#$%&*+\/|=@\\^~()<>[]{}0123456789AaBbCcDd +EeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz / !"#$%&'()*+,-.\/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRS +TUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ /r; # don't smash the original $a and $b, instead assi +gn to $z #printf "%-10s -> %-10s\n", $_, $z; $z; } ($a,$b); # transliterate $a and $b into $x and $y for sortin +g show_compare($a, $b, $x, $y); $x cmp $y } @list; say "\nSorted list"; say for @sorted_list; sub show_compare { my ($a, $b, $x, $y) = @_; printf "%-10s %-10s %s %-10s %-10s\n", $a, $x, (qw/< = >/)[1+($x c +mp $y)], $b, $y; }
Output:
.,-;:!?"'`_#$%&*+/|=@\^~()<>[]{}0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmN +nOoPpQqRrSsTtUuVvWwXxYyZz intended sequence !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcde +fghijklmnopqrstuvwxyz{|}~ natural ASCII sequence [Show @a and $b and their transliterated equivalents] "Hello" 5faoou5 < Abel X[ao (hello) FgaoouG < {adieu} LY_ia M @adieu BY_ia < [goodbye] Jeuu_[ aK Charlie \gY{oia > ^Charlie D\gY{oia ~Adieu EX_ia > zebra a[{Y 21708 ('-&. < baker [Yma{ . ! < - # ; 1 > : 0 ! 3 < ? 4 " 5 < ' 6 ` 7 < _ 8 "Hello" 5faoou5 < (hello) FgaoouG (hello) FgaoouG < Abel X[ao Abel X[ao > {adieu} LY_ia M "Hello" 5faoou5 < @adieu BY_ia @adieu BY_ia < (hello) FgaoouG (hello) FgaoouG < [goodbye] Jeuu_[ aK [goodbye] Jeuu_[ aK < {adieu} LY_ia M ^Charlie D\gY{oia > zebra a[{Y ^Charlie D\gY{oia < ~Adieu EX_ia Charlie \gY{oia > ~Adieu EX_ia zebra a[{Y < 21708 ('-&. 21708 ('-&. < ^Charlie D\gY{oia ^Charlie D\gY{oia < baker [Yma{ baker [Yma{ > ~Adieu EX_ia baker [Yma{ < Charlie \gY{oia "Hello" 5faoou5 > zebra a[{Y "Hello" 5faoou5 > 21708 ('-&. "Hello" 5faoou5 < ^Charlie D\gY{oia @adieu BY_ia < ^Charlie D\gY{oia ^Charlie D\gY{oia < (hello) FgaoouG (hello) FgaoouG > ~Adieu EX_ia (hello) FgaoouG < baker [Yma{ [goodbye] Jeuu_[ aK < baker [Yma{ baker [Yma{ > {adieu} LY_ia M baker [Yma{ > Abel X[ao . ! < : 0 : 0 > - # . ! < ! 3 ! 3 > - # ! 3 > : 0 ! 3 > ; 1 " 5 < ` 7 ` 7 > ' 6 . ! < " 5 " 5 > - # " 5 > : 0 " 5 > ; 1 " 5 > ! 3 " 5 > ? 4 zebra a[{Y < . ! . ! < 21708 ('-&. 21708 ('-&. > - # 21708 ('-&. < : 0 "Hello" 5faoou5 > : 0 "Hello" 5faoou5 > ; 1 "Hello" 5faoou5 > ! 3 "Hello" 5faoou5 > ? 4 "Hello" 5faoou5 > " 5 "Hello" 5faoou5 < ' 6 @adieu BY_ia > ' 6 @adieu BY_ia > ` 7 @adieu BY_ia > _ 8 Sorted list zebra . - 21708 : ; ! ? " "Hello" ' ` _ @adieu ^Charlie ~Adieu (hello) [goodbye] {adieu} Abel baker Charlie
It appears that tr is not transliterating in the way I expected it to. Can someone spot the error? Is there a better way to do this?

Replies are listed 'Best First'.
Re: Changing ASCII collating sequence for sort
by graff (Chancellor) on Apr 17, 2022 at 21:22 UTC
    I think your problem might be that in the tr/.../.../ operator inside your sort block, you forgot to put a backslash-escape in front of the hyphen characters.

    When used within the left or right side of the tr/// operator, hyphen is "magic": it interpolates to the set of characters between the preceding and following character, so ,-; (in the left-hand character sequence) becomes ,-.\/0123456789:;

    (and in the right-hand sequence, ,-. "becomes" ,-. because ASCII arranges those three characters as adjacent in exactly that order, so putting the backslash escape in front of that hyphen just happens to make no difference)

      This neatened it up a bit ...
      my @sorted_list = sort { tl($a) cmp tl($b) } @list; sub tl { return shift =~ tr / .,\-;:!?"'`_#$%&*+\/|=@\\^~()<>[]{}0123456789AaBbCcDdEeF +fGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz / -~/r; }
      Precisely. Thanks!
Re: Changing ASCII collating sequence for sort
by tybalt89 (Monsignor) on Apr 17, 2022 at 18:52 UTC

    Like this?

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11143026 use warnings; use List::AllUtils qw( sort_by ); print <<'EOF'; .,-;:!?"'`_#$%&*+/|=@\^~()<>[]{}0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmN +nOoPpQqRrSsTtUuVvWwXxYyZz intended sequence !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcde +fghijklmnopqrstuvwxyz{|}~ natural ASCII sequence EOF my @list = qw{ "Hello" Abel (hello) {adieu} @adieu [goodbye] Charlie ^ +Charlie ~Adieu zebra 21708 baker . - ; : ! ? " ' ` _ }; my @normalsorted = sort @list; print "@normalsorted\n\n"; my @newsorted = sort_by { tr/ .,\-;:!?"'`_#$%&*+\/|=@\\^~()<>[]{}01234 +56789AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz/ -~/r } @li +st; print "@newsorted\n\n";

    Outputs:

    .,-;:!?"'`_#$%&*+/|=@\^~()<>[]{}0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmN +nOoPpQqRrSsTtUuVvWwXxYyZz intended sequence !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcde +fghijklmnopqrstuvwxyz{|}~ natural ASCII sequence ! " "Hello" ' (hello) - . 21708 : ; ? @adieu Abel Charlie [goodbye] ^C +harlie _ ` baker zebra {adieu} ~Adieu . - ; : ! ? " "Hello" ' ` _ @adieu ^Charlie ~Adieu (hello) [goodbye] { +adieu} 21708 Abel baker Charlie zebra

      The same approach could be used with the fabulous Sort::Key collection of modules.

      use Sort::Key qw( keysort ); my @newsorted = keysort { tr/.../.../r } @list;

        I did, and don't call me Luke.</Airplane>

        (how do you think "sort_by" works internally?)

Re: Changing ASCII collating sequence for sort
by johngg (Canon) on Apr 18, 2022 at 10:14 UTC

    Rather than using transliteration you could employ look-ups and a GRT sort.

    use strict; use warnings; use feature qw{ say }; my $intended = <<'EOF'; .,-;:!?"'`_#$%&*+/|=@\^~()<>[]{}0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmN +nOoPpQqRrSsTtUuVvWwXxYyZz EOF chomp $intended; my $seq = 0; my %sortLU = map { $_ => sprintf q{%02d}, $seq ++ } split m{}, $intend +ed; my %revLU = reverse %sortLU; my @list = qw{ "Hello" Abel (hello) {adieu} @adieu [goodbye] Charlie ^Charlie ~Adieu zebra 21708 baker . - ; : ! ? " ' ` _ }; push @list, q{ first}; say for map { join q{}, map { $revLU{ $_ } } unpack q{(a2)*}, $_ } sort map { join q{}, map { $sortLU{ $_ } } split m{}, $_ } @list;

    The output

    first . - ; : ! ? " "Hello" ' ` _ @adieu ^Charlie ~Adieu (hello) [goodbye] {adieu} 21708 Abel baker Charlie zebra

    I hope this is useful.

    Update: I mistakenly omitted the space character from the start of the sort order string. Now corrected with an additional item " first" to check that spaces are sorted correctly.

    Cheers,

    JohnGG

Re: Changing ASCII collating sequence for sort
by vr (Curate) on Apr 19, 2022 at 07:53 UTC

    With Unicode::Collate (core module), only non-alphanumeric ordering has to be overridden. For task at hand, both full character names and quaternary weights could be simply omitted. Same about using asterisk instead of dot in element description; with dot, "variable" parameter wouldn't be necessary. However, seemingly "extra" work will, hopefully, help with vaguely looming further adjustments/customizations in the future.

    use strict; use warnings; use feature 'say'; use Unicode::Collate; use charnames ''; my $punctuation_order = ' .,-;:!?"\'`_#$%&*+/|=@\^~()<>[]{}'; my $weight_base = 0x101; my $collator = Unicode::Collate-> new( upper_before_lower => 1, variable => 'non-ignorable', entry => join '', map { sprintf "%04X ; [*%04X.0020.0002.%04X] # %s\n", ord, $weight_base ++, ord, charnames::viacode( ord ) } split '', $punctuation_order ); my @list = qw{ }; # skipped say for $collator-> sort( @list );

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11143026]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-03-28 14:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found