|go ahead... be a heretic|
Re: detecting the language of a word?by december (Pilgrim)
|on Dec 07, 2002 at 08:50 UTC||Need Help??|
This is hell. I don't really see how you can automate this - maybe you should run some kind of wiki the first weeks/months/years so people can change the language themselves when reading it. That way every time someone reads the document, they could also check and improve it.
For the initial job, I would seriously think about making a short dictionary with common 'foreign' words. Words like 'email' must occur frequently, and are easy to catch; rarely used foreign words are basically impossible to catch, when they look too much like an existing word.
What I would do, is try to find out the 'overall language' of a document. Words that don't comply to your language rules (e.g. german database) get first checked to this 'common foreign words' database. If they don't match there (you probably already filtered out the larger part now), add then to a list with a reference to the documents they are found in.
This list will only contain very rarely used foreign words, and you can do them manually; the references will only require you to assign a language once for every occurence.
--> 70% chance French
frequent words db email:en computer:en test:en
try to spellcheck sentences in other languages
uncommon words and expressions db
Hmm. I hope this makes sense. :)
This job is one of the hardest possible to automate, because it requires AI, basically. It's not about 100% matching, but rather fuzzy matching, (human) logic and context. The computer actually has to make sense out of the documents. Good luck with that...
I really suggest some kind of wiki-thing too, though. It will make it so much easier if people who read the document can change the language on-the-fly in case of errors.