Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^5: Sorting Vietnamese text

by farang (Hermit)
on Dec 23, 2013 at 04:28 UTC ( #1068136=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Sorting Vietnamese text
in thread Sorting Vietnamese text

sorted
 ỷ : (1) to be fat (said of a pig); (2) to depend on
 ỳ ạch : to toil, labor with difficulty
 ỷ eo : reproach someone with something
 yu nhau : to love each other, be in love
 yu qu : precious, valuable
 ỳ : inertia, state of inactivity, stay out, inert, sluggish
 ỷ lại : to depend, rely on others
 ỷ thế : count on ones power, ones position, ones influence
Okay, I also get that output when using the entire lines as written. However, cutting those lines short at or before the colon ':' gives this.
sorted
ỳ :
ỷ :
ỳ ạch :
ỷ eo :
yu nhau :
yu qu :
ỷ lại :
ỷ thế :
What seems to be going on is that due to the complicated rules for ordering in Vietnamese based on syllables, having the English translation after the Vietnamese is messing up the sorting.

I'd suggest trying to separate them into a hash if possible (split on the colon, maybe) so the sort can be based only on the Vietnamese.


Comment on Re^5: Sorting Vietnamese text
Re^6: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 23, 2013 at 15:03 UTC
    Thanks again, but what I'm looking for is that every word staring with ỳ comes before any word starting with ỷ, so even that sort order isn't quite right, Also why are all the entries with ỷ not together?. Instead of
    sorted
    ỳ :
    ỷ :
    ỳ ạch :
    ỷ eo :
    yu nhau :
    yu qu :
    ỷ lại :
    ỷ thế :
    

    should be

    sorted
    ỳ :
    ỳ ạch :
    ỷ :
    ỷ eo :
    ỷ lại :
    ỷ thế :
    yu nhau :
    yu qu :
    
    This is how all paper dictionaries do it, regardless of which order they use for the tone marks. I'm beginning to wonder if I'm the only person who's ever cared about this before :)

    By the way, the reason I'm doing this is that I'm planning to release a large (>50,000 words) Vietnamese-English dictionary (as a single UTF8 file) under the CC license (essentially free to use for any purpose) and I'd like to make it available in "properly" sorted order. I've done similar projects for Chinese, Esperanto, and Interlinga already (see www.denisowski.org), but those are a lot easier to sort :)

    Any other ideas? Thanks again for the help!

      Any other ideas?
      A quick search shows that vietnamese words are sorted by letters first, then tone-marks. Sometimes tone-marks are even ignored. So that's probably correct:
      ỳ :
      ỷ :
      ỳ ạch :
      http://vietunicode.sourceforge.net/charset/quytacABC_en.html ...looks like Unicode::Collate does it right, but additional first-character ordering is required to get a dictionary order (happens to be different from a simple sorted order in vietnamese).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1068136]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2014-07-28 23:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (210 votes), past polls