http://www.perlmonks.org?node_id=998507


in reply to Re: getting Unicode character names from string
in thread getting Unicode character names from string

If the goal is to just do a comparison of the base code-points, you should probably be using Unicode::Collate, at level 1: "alphabetic ordering". The next higher level provides "diacritic ordering", followed by "case ordering" (which combines the previous levels), and finally "tie-breaking".

Are there clear examples of how to do this in the documentation? Or elsewhere on the interwebs? Nothing's jumping out at me when I look/search. Thanks!

Replies are listed 'Best First'.
Re^3: getting Unicode character names from string
by davido (Cardinal) on Oct 11, 2012 at 20:51 UTC

    The synopsis for Unicode::Collate does a reasonable job of setting the stage, but there is a nice discussion in chapter 6 of Programming Perl (the camel book), 4th edition as well. You might also look at the Unicode Technical Standard #10: Unicode Collation Algorithm.

    Here's a brief example of doing comparisons at a lower (more relaxed) level using Unicode::Collate.

    use strict;
    use warnings FATAL => 'utf8';
    use utf8;
    
    use Unicode::Collate;
    binmode STDOUT, ':encoding(UTF-8)';
    
    my( $x, $y, $z ) = qw( α ά ὰ );
    
    my $c = Unicode::Collate->new;
    
    print "\nStrict collation rules: Level 4 (default)\n";
    print "\t cmp('α','ά'): ", $c->cmp( $x, $y ), "\n";
    print "\t cmp('ά','ὰ'): ", $c->cmp( $y, $z ), "\n";
    print "\t cmp('α','ὰ'): ", $c->cmp( $x, $z ), "\n";
    
    my $rc = Unicode::Collate->new( level => 1 );
    
    print "\nRelaxed collation rules: Level 1\n";
    print "\t cmp('α','ά'): ", $rc->cmp( $x, $y ), "\n";
    print "\t cmp('ά','ὰ'): ", $rc->cmp( $y, $z ), "\n";
    print "\t cmp('α','ὰ'): ", $rc->cmp( $x, $z ), "\n\n";
    

    And the output...

    
    Strict collation rules: Level 4 (default)
    	 cmp('α','ά'): -1
    	 cmp('ά','ὰ'): -1
    	 cmp('α','ὰ'): -1
    
    Relaxed collation rules: Level 1
    	 cmp('α','ά'): 0
    	 cmp('ά','ὰ'): 0
    	 cmp('α','ὰ'): 0
    
    

    And if the reason for doing comparisons is to handle sorting, Unicode::Collate does that too (you don't need to explicitly use Perl's core sort).


    Dave