Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^3: RFC: Is this the correct use of Unicode::Collate?

by moritz (Cardinal)
on Jan 17, 2012 at 15:39 UTC ( #948339=note: print w/ replies, xml ) Need Help??


in reply to Re^2: RFC: Is this the correct use of Unicode::Collate?
in thread RFC: Is this the correct use of Unicode::Collate?

The implication in the article was that you could replace 'sort' with 'Unicode::Collate'.

And that seems to be the real problem. sort isn't broken (that's just link baiting), and neither is Unicode::Collate. They just do different things.

The article does say

Fortunately, you don't have to come up with your own algorithm for dictionary sorting, because Perl provides a standard class to do this for you: Unicode::Collate

So despite its title, it doesn't mandate UC to be a universal replacement for sort, but just for one application.


Comment on Re^3: RFC: Is this the correct use of Unicode::Collate?
Re^4: RFC: Is this the correct use of Unicode::Collate?
by flexvault (Parson) on Jan 17, 2012 at 16:10 UTC

    moritz,

    But all the references in the article are related to data in databases. I goggled ASCII and UTF-8, and found many times "...UTF-8 uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding...", so why are the 0 - 127 characters being redefined? I understand the complexity of the subject, but the designers of UTF-8 knew better than to mess with ASCII, and that is why UTF-8 enhances ASCII.

    'Unicode::Collate' is core, so it could be used a lot in the future, as it should be. But a lot of production environments will be affected if they don't know in advance that the code points of ASCII have been redefined.

    My hope was that someone would say 'ASCII => 1' will work like Perl 'sort' for ASCII characters and UTF-8, etc for anything above 127.

    Thank you

    "Well done is better than well said." - Benjamin Franklin

      'Unicode::Collate' is core, so it could be used a lot in the future, as it should be. But a lot of production environments will be affected if they don't know in advance that the code points of ASCII have been redefined.
      A text sort looks nothing at all like a code point sort. You seem to think that 7 bit code points should not sort as a text. That completely defaults the whole purpose.

      Watch here to see what really happens:

      $ perl -MUnicode::Collate -E 'for (Unicode::Collate->new->sort(map { c +hr } 0..127)) { say "chr ", ord, "\t", /\p{graph}/ ? $_ : "(unprinta +ble)" }' chr 0 (unprintable) chr 1 (unprintable) chr 2 (unprintable) chr 3 (unprintable) chr 4 (unprintable) chr 5 (unprintable) chr 6 (unprintable) chr 7 (unprintable) chr 8 (unprintable) chr 14 (unprintable) chr 15 (unprintable) chr 16 (unprintable) chr 17 (unprintable) chr 18 (unprintable) chr 19 (unprintable) chr 20 (unprintable) chr 21 (unprintable) chr 22 (unprintable) chr 23 (unprintable) chr 24 (unprintable) chr 25 (unprintable) chr 26 (unprintable) chr 27 (unprintable) chr 28 (unprintable) chr 29 (unprintable) chr 30 (unprintable) chr 31 (unprintable) chr 127 (unprintable) chr 9 (unprintable) chr 10 (unprintable) chr 11 (unprintable) chr 12 (unprintable) chr 13 (unprintable) chr 32 (unprintable) chr 96 ` chr 94 ^ chr 95 _ chr 45 - chr 44 , chr 59 ; chr 58 : chr 33 ! chr 63 ? chr 46 . chr 39 ' chr 34 " chr 40 ( chr 41 ) chr 91 [ chr 93 ] chr 123 { chr 125 } chr 64 @ chr 42 * chr 47 / chr 92 \ chr 38 & chr 35 # chr 37 % chr 43 + chr 60 < chr 61 = chr 62 > chr 124 | chr 126 ~ chr 36 $ chr 48 0 chr 49 1 chr 50 2 chr 51 3 chr 52 4 chr 53 5 chr 54 6 chr 55 7 chr 56 8 chr 57 9 chr 97 a chr 65 A chr 98 b chr 66 B chr 99 c chr 67 C chr 100 d chr 68 D chr 101 e chr 69 E chr 102 f chr 70 F chr 103 g chr 71 G chr 104 h chr 72 H chr 105 i chr 73 I chr 106 j chr 74 J chr 107 k chr 75 K chr 108 l chr 76 L chr 109 m chr 77 M chr 110 n chr 78 N chr 111 o chr 79 O chr 112 p chr 80 P chr 113 q chr 81 Q chr 114 r chr 82 R chr 115 s chr 83 S chr 116 t chr 84 T chr 117 u chr 85 U chr 118 v chr 86 V chr 119 w chr 87 W chr 120 x chr 88 X chr 121 y chr 89 Y chr 122 z chr 90 Z
      See? A text sort looks nothing whatsoever like a code point sort. If you expect the UCA to do a code-point sort on 7-bit code points but a text sort on everything else, I fear that you have gravely misunderstood its purpose and consequences.

      So what may I do to help you understand this better? I would seriously like to know. chr 40 ( chr 41 ) chr 91

        tchrist,

        Thank you for your explanation/demonstration of how the UCA sort works.

        I have already answered your previous post, and have apologized for mis-quoting the article.

        It is not whether something prints or not that matters to the database engine, but rather it is the 'lt, eq, gt' that counts. Each key must be ordered so that every key before it must be less than, and every key after it must be greater than. So looking at your example, it seems that only chr(0) to chr(31) would be a problem.

        I have written 3 database engines in my life; in the 70's in assembler, in the 80's in C, and recently in Perl. Unfortunately, staring at a lot of hex dumps is required ( even in Perl ). The one thing all of these had in common, it that all data passed from the user must be inserting into the database. So when a database is created the start key is "" value ( length of 0). This is because the user could put in:

        $key="\0"; $data = "\0";
        which are valid characters. Now, that could be fixed by documenting this behavior. But the chr(0) to chr(31) is used for many internal things for the DB engine and changing the order in sort would be a show stopper.

        Thank you

        "Well done is better than well said." - Benjamin Franklin

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://948339]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2014-11-29 08:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (204 votes), past polls