Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Language::MySort

by japhy (Canon)
on May 27, 2003 at 06:34 UTC ( [id://260942]=CUFP: print w/replies, xml ) Need Help??

This is a rather small but useful module I hacked together a little while ago. It creates a transformation and sorting routine for any alphabet you give it. If you've got an alphabet in which vowels are sorted before consonants, you can use this module to create a sorting function that takes that into account.

You have to deal with lowercase and uppercase yourself, since (as in Klingon) they needn't sort to the same location. Supports a maximum of 256-character alphabets.

I just updated it, changing the function syntax a bit, and adding a feature.

package Language::MySort; require Exporter; @ISA = qw( Exporter ); @EXPORT = qw( lang sort ); %words = (); sub lang_sort { my ($ignore, $same, $chars, $tr, $sorter) = ("", ""); if (ref $_[-1]) { my $opt = pop; $ignore = $opt->{ignore} || ""; $same = $opt->{translate} || ""; $ignore = "\$s =~ tr/\Q$ignore\E//d;"; if ($same) { my @f = map substr($_, 0, 1, ""), @$same; $same = " =~ tr/" . quotemeta(join "", @$same) . "/" . quotemeta(join "", map $f[$_] x length($same->[$_]), 0 .. $#$s +ame) . "/"; } } $chars = @_ == 1 ? shift : join "", @_; $tr = eval qq{ sub { (my \$s = shift) $same; $ignore \$s =~ tr/\Q$chars\E/\000-\377/; \$s; } }; $sorter = sub { my @used = map $tr->($_), @_; @{ $words{$chars} }{ @used } = @_; @{ $words{$chars} }{ sort @used }; }; return wantarray() ? ($sorter, $tr) : $sorter; } 1;
Here's a sample run to create a sorter for (lowercase) French text (I don't think I left out any accented characters, but I could be wrong).
use Language::MySort; *french_sort = lang_sort( # *the character list* # only includes the characters remaining after # the identical-character map has been applied 'a' .. 'z', { # *the identical-character map* # maps characters to the character # they should sort identically as # "AXYZ" means that X, Y, and Z are translated as A identical => ["a\340", "c\347", "e\350\351\352\353", "o\364"], } ); { local $, = " "; print french_sort( "\351tude", "\352tre", "tr\350s", "entrer", "\351t\351", ); }
And here's a sample run for a small language of 10 characters in which vowels "a", "e", and "i" sort before every other letter, and ignores the language's mid-word punctuation, "-" and ".":
use Language::MySort; *weird_sort = lang_sort( # place vowels ahead of consonants qw( a e i b c d f g h j ), { # map uppercase characters to lowercase identical => [qw( aA bB cC dD eE fF gG hH iI jJ )], # ignore - and . ignore => "-.", } );
Because of the way the generator function works (using the tr/// operator), you can also write the above function call as:
use Language::MySort; *weird_sort = lang_sort( # place vowels ahead of consonants qw( a e i ), 'a' .. 'j', { # map uppercase characters to lowercase identical => [qw( aA bB cC dD eE fF gG hH iI jJ )], # ignore - and . ignore => "-.", } );
Even though the vowels are duplicated in the character list, the transliteration operator will only recognize the first occurrence of them. It's a bit of Perl magic that the module takes advantage of to make your life a bit easier.

Finally, here's a simpler sorter for English alphabetical order that puts capital letters before their lowercase counterparts, but intersperses uppercase and lowercase words (so you get Axxx axxx Bxxx bxxx, not Axxx Bxxx axxx bxxx).

use Language::MySort; *sorter = lang_sort( # nifty way to make (A, a, B, b, C, c, ... Z, z) (map +($_, lc), 'A' .. 'Z') { ignore => q{-} } );

Replies are listed 'Best First'.
Re: Language::MySort
by Zaxo (Archbishop) on May 27, 2003 at 12:38 UTC

    Does this improve on the POSIX locale system? I can see that it would for Klingon, but for most purposes it seems simpler to (adapted from perllocale):

    use POSIX qw( locale_h ); my $oldloc = setlocale( LC_COLLATE, 'fr-CA.ISO8859-1') or warn 'Locale missing', $!; use locale; # go on to sort things
    That approach makes available all the locale knowlege that the C libs have.

    After Compline,
    Zaxo

      Whilst he described it in terms of sorting an arbitrary alphabet, it's not limited to alphabets. What if you want to combine the normal English sort order with a custom sort-order for punctuation characters? Such as ignoring the $ and @ signs, so that $_ and @_ get sorted together rather than having a bunch of crap in between.
Re: Language::MySort
by theorbtwo (Prior) on May 28, 2003 at 00:12 UTC

    I've got some feature requests, which I think would require changing how you do it in some cases... but I've got some ideas there.

    First off, and I don't have an implementation idea here: Could you make identical sequences longer then one char? For example, in German, ue should sort the same place as ü often, and likewise ae=>ä, oe=>ö (ë, ï, and ÿ aren't in German). (Also, s-set/sharp-s should sort the same as ss, but I'm too lazy to type that properly.) (This isn't quite the traditional sort-order, BTW.)

    Also, the possiblity to have alphebets longer the 255 chars would be nice. You could do this by having the RHS of your tr be based on the count of chars in the alphabet rather then a static 0-255.


    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

      ue should sort the same place as ü often, and likewise ae=>ä, oe=>ö

      Same for esperanto.

      The Esperanto alphabet is
      a b c ĉ d e f g ĝ h ĥ i j ĵ k l m n o p r s ŝ t u ŭ v z.

      But in ASCII this is written as
      a b c cx d e f g gx h hx i j jx k l m n o p r s sx t u ux v z.
      This is safe, because there is no real "x".

      It'd be nice if we could sort ASCII German or Esperanto without using Lingua::DE::Ascii or Lingua::EO::Supersignoj.

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

Re: Language::MySort
by Willard B. Trophy (Hermit) on Feb 13, 2004 at 18:21 UTC
    Does this give you more than Sort::ArbBiLex does?

    --
    bowling trophy thieves, die!

      No, less. It doesn't do single-to-multiple character transformations as asked for in other replies here. I found the module you speak of after I wrote the node.
      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://260942]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2024-04-19 01:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found