Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^5: Sorting Vietnamese text

by farang (Hermit)
on Dec 23, 2013 at 04:28 UTC ( #1068136=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Sorting Vietnamese text
in thread Sorting Vietnamese text

sorted
 ỷ : (1) to be fat (said of a pig); (2) to depend on
 ỳ ạch : to toil, labor with difficulty
 ỷ eo : reproach someone with something
 yêu nhau : to love each other, be in love
 yêu quí : precious, valuable
 ỳ : inertia, state of inactivity, stay out, inert, sluggish
 ỷ lại : to depend, rely on others
 ỷ thế : count on one’s power, one’s position, one’s influence
Okay, I also get that output when using the entire lines as written. However, cutting those lines short at or before the colon ':' gives this.
sorted
ỳ :
ỷ :
ỳ ạch :
ỷ eo :
yêu nhau :
yêu quí :
ỷ lại :
ỷ thế :
What seems to be going on is that due to the complicated rules for ordering in Vietnamese based on syllables, having the English translation after the Vietnamese is messing up the sorting.

I'd suggest trying to separate them into a hash if possible (split on the colon, maybe) so the sort can be based only on the Vietnamese.


Comment on Re^5: Sorting Vietnamese text
Re^6: Sorting Vietnamese text
by pdenisowski (Acolyte) on Dec 23, 2013 at 15:03 UTC
    Thanks again, but what I'm looking for is that every word staring with ỳ comes before any word starting with ỷ, so even that sort order isn't quite right, Also why are all the entries with ỷ not together?. Instead of
    sorted
    ỳ :
    ỷ :
    ỳ ạch :
    ỷ eo :
    yêu nhau :
    yêu quí :
    ỷ lại :
    ỷ thế :
    

    should be

    sorted
    ỳ :
    ỳ ạch :
    ỷ :
    ỷ eo :
    ỷ lại :
    ỷ thế :
    yêu nhau :
    yêu quí :
    
    This is how all paper dictionaries do it, regardless of which order they use for the tone marks. I'm beginning to wonder if I'm the only person who's ever cared about this before :)

    By the way, the reason I'm doing this is that I'm planning to release a large (>50,000 words) Vietnamese-English dictionary (as a single UTF8 file) under the CC license (essentially free to use for any purpose) and I'd like to make it available in "properly" sorted order. I've done similar projects for Chinese, Esperanto, and Interlinga already (see www.denisowski.org), but those are a lot easier to sort :)

    Any other ideas? Thanks again for the help!

      Any other ideas?
      A quick search shows that vietnamese words are sorted by letters first, then tone-marks. Sometimes tone-marks are even ignored. So that's probably correct:
      ỳ :
      ỷ :
      ỳ ạch :
      http://vietunicode.sourceforge.net/charset/quytacABC_en.html ...looks like Unicode::Collate does it right, but additional first-character ordering is required to get a dictionary order (happens to be different from a simple sorted order in vietnamese).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1068136]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2014-10-21 03:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (95 votes), past polls