Re^4: best sort

Replies are listed 'Best First'.
Re^5: best sort by tchrist (Pilgrim) on Aug 16, 2011 at 13:27 UTC
Then ignore the matter of letters with diacritics for now, since you seem either not to use them or else not to care if they change something from being the same letter and it send scurrying off to a completely different segment of your output. Instead, just look at a classic “dictionary sort”, where you fold/ignore case and ignore everything but alphanumerics. That is the way phonebooks and card catalogues have historically been ordered in English, since way back before computers even existed. It is useful. You can kind of get that running the shell `sort -dfu` command, but that program is so obsessively oriented toward whitespace separated fields that you need to trick it into using a nonexistent separator, like here using Control-C: `$ perl -nle 's/^\s#\sdefine\s+// and print' perl/*.h \| sort -t^C -df` [download] (output excerpts inside the `<readmore>`) Read more... (3 kB) See how useful that is? That’s why we’ve done it that way for hundreds of years, because it helps. You certainly don’t need Unicode to demonstrate this principle, as it applies to any text no matter the character repertoire. I should probably have been clearer about this, because it is an important point: In these days of bizarre branding using eyecatching typography to get your attention like “CamelCase” and “StUdLyCaPs”, free variation in spacing and hyphenation like “Post-it® Notes” and “postit notes”, and trademarks with non‐ASCII in them like “Häagen‐Dazs®” with its meaningless diacritic or “Encyclopædia Britannica” with its old‐school ligature, it is perhaps more important today than ever before that we have easy access to collation algorithms able to treat “`FOOBAR`”, “`__foobar__`”, “`Foo Bar`”, “`foo-bar`”, and “`ƒöøbɐƦ`” as minor variations of the same underlying sequence of six basic letters. It is ultra‐useful to be able to think of things that ignore things like diacritics, casing, and non‐alphanumerics. There is nothing new here, since this is has historically been one ordering option frequently used by lexicographers, and it remains completely useful today. With more characters in our repertoires that ever before, it’s much harder to group them the way manual sorters have always done it in the past. But we still want to do so. Those are some of the appeals of the `Unicode::Collate` module for sorting text. I’ll have to think about how to get this major point across more effectively, because I don’t seem to have done so yet.	[reply] [d/l] [select]
Re^6: best sort by Tanktalus (Canon) on Aug 16, 2011 at 13:53 UTC
I’ll have to think about how to get this major point across more effectively, because I don’t seem to have done so yet. I'm not sure you'll be able to. And this isn't a slam against BrowserUk: for the most part, I'm in agreement with him. What you're successfully doing is showing where the naive (notice only one dot above the i here - I doubt anyone is confused as to what word that is) sort is insufficient. What you're unsuccessfully doing is showing where, for 95%+ (I suspect BrowserUk to be close to accurate with his 1-2% estimate where your details become important) of the time, I should still care. In my ~16 years of paid programming work, I have not yet encountered a time where a naive sort is insufficient for the work at hand. Now, granted, the first ~2 years was as a student, and those companies produced English-only output, and no names were involved, so 7-bit ascii was more than sufficient. For the last 14 years, I've worked in I18N/L10N-enabled software, though, and it still doesn't come up. Sorting doesn't come up often, but, thus far, the locale-sensitive order has been overkill. (Another team on the same product has integrated ICU for doing sorting where it matters, but that's something like 3 or 4 developers out of a team of 300, right in that 1-2% pocket that BrowserUk mentioned.) Obviously, locale-sensitive ordering is a passion of yours. Where it seems to me that you're failing to come across (and I can't speak for whether this would get BrowserUk or anyone else to your side or not, I can only speak for myself) is that you're speaking from your perspective, not mine. That can be a Hard Thing™ to do. When is it, in the other 90%+, that I should care about locale-based collation? I mean, I'm glad there's a module that handles 90%+ of the cases where I do care about locale sensitive collation, but if I only need that 2% of the time, I'm not going to incur the overhead of figuring out what, if any, parameters need to be passed in to Unicode::Collate's constructor, or the runtime overhead, for the other 98% (I assume that if the naive sort is sufficient that U::C's constructor doesn't need any parameters, though I don't know that yet). Hope that helps. Update: I suppose it didn't help. Ah well, people who don't respond in good faith don't seem to want responses, so I'll just leave it at this.	[reply]
Re^7: best sort by tchrist (Pilgrim) on Aug 16, 2011 at 16:15 UTC
the naive (notice only one dot above the i here - I doubt anyone is confused as to what word that is) Why certainly: it’s clearly the one that rimes with waive and glaive, of course. I happen to be familiar with the rools of English orθograφy, you know. I doubt anyone is confused as to what the words I wrote are, either; that’s hardly the point.	[reply]
Re^6: best sort by BrowserUk (Patriarch) on Aug 16, 2011 at 14:56 UTC
I’ll have to think about how to get this major point across more effectively, because I don’t seem to have done so yet. For my part, you have got across the message that when (if), I need to sort text for human lookup, your module is the way to go. The salient part for me is the italicised part of that sentence. My criticism of your advertising blurb is a matter of emphasis. It is that you (way) over-emphasise the frequency that dictionary ordering is an important part of the use of sort. It pretty much completely ignores the many algorithmic uses of sorting. As for my (lack of) use of Unicode. For the most part, I do not have any need of it. Further, I find that the emphasis to embrace Unicode is misbegotten. The lingua-franca of computing (and science in general) is English. If you are a biologist, then you need to have a working knowledge of latin in order to be able to understand and interpret (pronounce and remember:) the biological classification system. If you are a musician, you pretty much need to be able to read music to be able to communicate with other musicians. And if you are a computer scientist, you need to be able to read and write in English in order to be able to use resources like the the IETF RFCs. They manage to express a whole heap of very complex ideas using nothing more than 7-bit ascii. Even if you translated them into all the world's languages, the task of verifying that they were all in technical accordance would be impossible. Whilst with the advent of the consumerist WWW and global markets, programs need to be able to deal with the full range of the world's writing systems, programs should for the most part, treat non-ascii text -- names, addresses etc -- as opaque binary packets to be received from the user and presented back without analysis or translation. In contrast to your assertions above, IMO, Unicode is not text. (How could you possibly sort Chinese, Japanese, Thai, Russian, Arabic and Farsi names into your schema?). It is a set of (incompatible) binary standards. And very bad ones at that. The absence of any mechanism to determine if a file of data is Unicode; and if it is, which of the many forms of Unicode it might be; is frankly ludicrous. It is like taking all the image file formats and stripping out their type headers. Unicode is a mess. A (set of) kludged together, interim solutions that have been promoted to a (set of) standards that should never have been. Far worse (IMO) than the code-page mechanism. Whilst I am in awe of your efforts to make sense of the whole mess and to render it vaguely usable from Perl, in the long term I think that such efforts (across the industry rather yours specifically), may be counter productive. The problem is, that with usability -- even as limited as it is -- comes longevity. Which means that better solutions will not be sought, much less adopted. Do a search in your favourite search engine for `unicode wrong` to see the mess that is Unicode. Anything that is that easy to get wrong should have been allowed to die a natural death. As with many previous bad fads -- pet rocks, glassless glasses, bell-bottom trousers and Y2K hysteria -- my apparently lone voice will be seen as swimming against the flow, but don't forget which direction the survivors in the The Poseiden Adventure went in :) History will tell. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^7: best sort by tchrist (Pilgrim) on Aug 16, 2011 at 16:05 UTC
Which human languages are you fluent in, and which ones are you merely competent in? Also, how much higher mathematics have you done? That’s not a flip question, nor a prying one. It is directly relevant to the discussion at hand, because its answer makes a difference. You see, I’m trying to understand your biases. I know what it looks like, but I’m hoping I’m wrong about it. That’s of course why I’ve asked. If you never have cause to use anything that God didn’t put into ASCII because you’re an English monoglot who refuses to spell imported words or even people’s names they wish them spelled... if you plan to stay in your humble hamlet the rest of your life... if have no cause for such specialist characters like dashes, curly quotes, degree symbols, or the extraordinarily rich set of symbols needed by by scientists and mathematicians... and if you don’t care to interact with people who do... ...then it makes perfect sense that you will have a very different set of biases compared with people who actually do any of those things. All I keep hearing is the same old grumpy-grampa story about walking to school in the rain uphill everyday both ways — that is, that ASCII was good enough for you when you were a tad, so by gawd awlmighty it should be good enough for these selfdeluding young whippersnappers. You are also brazen in your profoundly disturbing advocacy of the offensive position that everybody else in the world should learn your bloody language instead of ever once admitting that just maybe you ought to learn theirs. As for Unicode, just because you don’t understand it or don’t like it doesn’t mean there is something “wrong” with it. Most of your statements about it are either flat-out wrong or so misleading and misrepresentative as to make any rational person wonder why you would be intentionally deceptive. Unicode is not going away. You’ll be dead long before it is, something pretty much guaranteed by your “over my dead body” attitude about condescending to learning anything new. Unicode is here, and it’s here to stay, and no amount of old FUDdy bellyaching from you is going to change that. That means you jolly well ought to get used to it. Either that, or retire and crawl back into your tiny little hole and die. Your choice. Some of us prefer to engage the world, not fight against it. And that’s our choice. I haven’t seen you doing anything to try to make Perl better, whether in its Unicode handling, its text processing, nor indeed anything else. I haven’t seen any bug reports, patches, or even questions. And I certainly haven’t seen any feedback from you during the review period for the various public issues that come up. That gives the appearance that all you want to do is complain, and that makes you part of the problem set, not the solution set. If you won’t work with the rest of us to make the world a better place, than at least have the human decency to stop trying to make it a worse one — just let us go about our own business unhindered and unharangued. Perhaps I’m wrong. If so, it’s perfectly easy for you to show me that in a way that is publicly credible. Merely publish your full legal name like the rest of us responsible internet citizens do, and I’ll search the relevant discussion archives for your constructive participation in these matters. As soon as I find it, I’ll gladly reconsider my position. Frankly, I’m looking forward to it, because the alternative is pretty sickening. Otherwise you’re just another greyheaded internet loudmouth who sickly enjoys bitching and attacking just to wind people up and waste their time, and who can’t be bothered to do one damn thing toward bettering the situation he won’t stop ranting about. In other words, put up or shut up. One or another of those two little spirits who sits upon our shoulders feeding us counsel is whispering that the smart money says you’ll do neither, but we shall see what we shall see, shan’t we now? My cards are on the table for everyone to look at; time for you to show yours.	[reply]
Re^8: best sort by BrowserUk (Patriarch) on Aug 16, 2011 at 20:20 UTC
Re^9: best sort by tchrist (Pilgrim) on Aug 18, 2011 at 00:08 UTC
Some notes below your chosen depth have not been shown here
Re^5: best sort by jdporter (Paladin) on Aug 19, 2011 at 04:18 UTC
You use emotive terms.... [you] pour scorn... Well if that ain't the pot calling the lily black...	[reply]
Re^6: best sort by BrowserUk (Patriarch) on Aug 19, 2011 at 05:04 UTC
The distinction: I'm not writing for "the draft manuscript of Programming Perl, 4ᵗʰ edition,. I'm not about to publish the notion that "sort, ... does is something surprisingly limited usefullness.", when at least 90% of algorithmic uses of sort have no need of a limited international dictionary ordering.	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks