Re: detecting the language of a word?

This is hell. I don't really see how you can automate this - maybe you should run some kind of wiki the first weeks/months/years so people can change the language themselves when reading it. That way every time someone reads the document, they could also check and improve it.

For the initial job, I would seriously think about making a short dictionary with common 'foreign' words. Words like 'email' must occur frequently, and are easy to catch; rarely used foreign words are basically impossible to catch, when they look too much like an existing word.

What I would do, is try to find out the 'overall language' of a document. Words that don't comply to your language rules (e.g. german database) get first checked to this 'common foreign words' database. If they don't match there (you probably already filtered out the larger part now), add then to a list with a reference to the documents they are found in.

This list will only contain very rarely used foreign words, and you can do them manually; the references will only require you to assign a language once for every occurence.

e.g.

text blah.html:
Dans un email récent, mon frère a écrit "Was die Augen sehen, glaubt das Herz". Il doit l'avoir entendu quelque part. Gnarf.

analyze language --> 70% chance French
--> set base language to 'French'
--> 'email', 'gnarf' and the German sentence don't match

frequent words db email:en computer:en test:en
--> 'email' matches and has it's language set to 'en'
--> 'gnarf' and the German sentence don't match, the search continues

try to spellcheck sentences in other languages
--> if you find two spelling mistakes in one sentence, you could try to match that whole sentence to another language, and if the spellingcheck returns (near) zero for a certain language, it's most likely that language.
--> 0 spelling errors for the German sentence with spellGerman, set that sentence to German.
--> 'gnarf' still unmatched

uncommon words and expressions db
--> everything that really doesn't make sense ends up here, with references as where that word is found:
"gnarf" blah.html woof.html foobar.txt

manual intervention
--> 'gnarf' set to German
--> all referenced documents with the word/expression are updated to the chosen language

Hmm. I hope this makes sense. :)

This job is one of the hardest possible to automate, because it requires AI, basically. It's not about 100% matching, but rather fuzzy matching, (human) logic and context. The computer actually has to make sense out of the documents. Good luck with that...

I really suggest some kind of wiki-thing too, though. It will make it so much easier if people who read the document can change the language on-the-fly in case of errors.

Comment on Re: detecting the language of a word?


Pathologically Eclectic Rubbish Lister
	PerlMonks