http://www.perlmonks.org?node_id=1068133


in reply to Re^3: Sorting Vietnamese text
in thread Sorting Vietnamese text

Sorry, still doesn't work -- see below. For example, I would expect "ỳ " (note space) to come before "ỳ ạch"

This is what I've been struggling with for a LONG time :)

unsorted
 ỷ : (1) to be fat (said of a pig); (2) to depend on
 ỳ : inertia, state of inactivity, stay out, inert, sluggish
 ỳ ạch : to toil, labor with difficulty
 ỷ eo : reproach someone with something
 ỷ lại : to depend, rely on others
 ỷ thế : count on one’s power, one’s position, one’s influence
 yêu nhau : to love each other, be in love
 yêu quí : precious, valuable

sorted
 ỷ : (1) to be fat (said of a pig); (2) to depend on
 ỳ ạch : to toil, labor with difficulty
 ỷ eo : reproach someone with something
 yêu nhau : to love each other, be in love
 yêu quí : precious, valuable
 ỳ : inertia, state of inactivity, stay out, inert, sluggish
 ỷ lại : to depend, rely on others
 ỷ thế : count on one’s power, one’s position, one’s influence

Replies are listed 'Best First'.
Re^5: Sorting Vietnamese text
by farang (Chaplain) on Dec 23, 2013 at 04:28 UTC

    sorted
     ỷ : (1) to be fat (said of a pig); (2) to depend on
     ỳ ạch : to toil, labor with difficulty
     ỷ eo : reproach someone with something
     yêu nhau : to love each other, be in love
     yêu quí : precious, valuable
     ỳ : inertia, state of inactivity, stay out, inert, sluggish
     ỷ lại : to depend, rely on others
     ỷ thế : count on one’s power, one’s position, one’s influence
    
    Okay, I also get that output when using the entire lines as written. However, cutting those lines short at or before the colon ':' gives this.
    sorted
    ỳ :
    ỷ :
    ỳ ạch :
    ỷ eo :
    yêu nhau :
    yêu quí :
    ỷ lại :
    ỷ thế :
    
    What seems to be going on is that due to the complicated rules for ordering in Vietnamese based on syllables, having the English translation after the Vietnamese is messing up the sorting.

    I'd suggest trying to separate them into a hash if possible (split on the colon, maybe) so the sort can be based only on the Vietnamese.

      Thanks again, but what I'm looking for is that every word staring with ỳ comes before any word starting with ỷ, so even that sort order isn't quite right, Also why are all the entries with ỷ not together?. Instead of
      sorted
      ỳ :
      ỷ :
      ỳ ạch :
      ỷ eo :
      yêu nhau :
      yêu quí :
      ỷ lại :
      ỷ thế :
      

      should be

      sorted
      ỳ :
      ỳ ạch :
      ỷ :
      ỷ eo :
      ỷ lại :
      ỷ thế :
      yêu nhau :
      yêu quí :
      
      This is how all paper dictionaries do it, regardless of which order they use for the tone marks. I'm beginning to wonder if I'm the only person who's ever cared about this before :)

      By the way, the reason I'm doing this is that I'm planning to release a large (>50,000 words) Vietnamese-English dictionary (as a single UTF8 file) under the CC license (essentially free to use for any purpose) and I'd like to make it available in "properly" sorted order. I've done similar projects for Chinese, Esperanto, and Interlinga already (see www.denisowski.org), but those are a lot easier to sort :)

      Any other ideas? Thanks again for the help!

        Any other ideas?
        A quick search shows that vietnamese words are sorted by letters first, then tone-marks. Sometimes tone-marks are even ignored. So that's probably correct:
        ỳ :
        ỷ :
        ỳ ạch :
        http://vietunicode.sourceforge.net/charset/quytacABC_en.html ...looks like Unicode::Collate does it right, but additional first-character ordering is required to get a dictionary order (happens to be different from a simple sorted order in vietnamese).
Re^5: Sorting Vietnamese text
by Jim (Curate) on Dec 23, 2013 at 03:35 UTC

    Are you absolutely certain your text is Unicode (UTF-8)? It's not TCVN (CP1258, ISO-2022-VN or EUC-VN), is it?

    I'm sorry if this is an "Is the power cord plugged in?" kind of question, but it just doesn't make sense that you're getting different output than farang got.

    Jim