Re: Idiom guessing script

by Albannach (Prior)
on Nov 21, 2005 at 04:07 UTC ( #510353=note: print w/replies, xml ) Need Help??

in reply to Idiom guessing script

It does not sound like a simple problem, because you are not dealing with much data upon which to base your decision. It strikes me that it may be possible to choose just a few hundred words from each potential language, words that are both commonly used and relatively unique to that tongue. However even this may not work for something like book titles which are not necessarily common usage (in English at least). If you could get large word lists for different languages (perhaps take a sample from some major newspapers?) you could build your own such list of 'indicator words'. I would not keep the langages separate, but have each word in the list tagged as to what language(s) it suggests, then you could sort of take a poll of your title's words to get a guess as to the language used.

On the chance that you are actually talking about book titles, perhaps it would help you to know that the ISBN issued for every book published starts with a code called the Group Identifier. While this is not necessarily a reliable indicator of the language, it may be of some use, perhaps to verify a language-based determination, or to help you select what language(s) to test against.

