http://www.perlmonks.org?node_id=1068197


in reply to Re^5: Sorting Vietnamese text
in thread Sorting Vietnamese text

Thanks again, but what I'm looking for is that every word staring with ỳ comes before any word starting with ỷ, so even that sort order isn't quite right, Also why are all the entries with ỷ not together?. Instead of
sorted
ỳ :
ỷ :
ỳ ạch :
ỷ eo :
yêu nhau :
yêu quí :
ỷ lại :
ỷ thế :

should be

sorted
ỳ :
ỳ ạch :
ỷ :
ỷ eo :
ỷ lại :
ỷ thế :
yêu nhau :
yêu quí :
This is how all paper dictionaries do it, regardless of which order they use for the tone marks. I'm beginning to wonder if I'm the only person who's ever cared about this before :)

By the way, the reason I'm doing this is that I'm planning to release a large (>50,000 words) Vietnamese-English dictionary (as a single UTF8 file) under the CC license (essentially free to use for any purpose) and I'd like to make it available in "properly" sorted order. I've done similar projects for Chinese, Esperanto, and Interlinga already (see www.denisowski.org), but those are a lot easier to sort :)

Any other ideas? Thanks again for the help!

Replies are listed 'Best First'.
Re^7: Sorting Vietnamese text
by Atacama (Sexton) on Dec 25, 2013 at 04:28 UTC
    Any other ideas?
    A quick search shows that vietnamese words are sorted by letters first, then tone-marks. Sometimes tone-marks are even ignored. So that's probably correct:
    ỳ :
    ỷ :
    ỳ ạch :
    http://vietunicode.sourceforge.net/charset/quytacABC_en.html ...looks like Unicode::Collate does it right, but additional first-character ordering is required to get a dictionary order (happens to be different from a simple sorted order in vietnamese).