|Just another Perl shrine|
Re: detecting the language of a word?by PhiRatE (Monk)
|on Dec 07, 2002 at 14:02 UTC||Need Help??|
Ok. A couple of points. Anyone who has read a few of my posts know that I'm a complete nut for SQL, however in this case I don't really think it'll help you. The search-set for words is primarily static, and would do pretty well in a standard hash.
To your main problem however, I make a few observations. Firstly, I'm not sure you're ever going to get a 100% accurate method, so if you require that, stop thinking about 100% automatic solutions right now. If on the other hand a 95-99% is ok, then we can proceed.
Secondly, I believe the first step always has to be to deduce the primary language in use. This can either be pre-set, if you're sure all the documents are basically German, or you can do a fairly easy determination based on word percentages against language dictionaries.
The final part, determining which words are foreign, submits nicely to what I call the "shotgun" method. In this method, we take a bunch of good ideas, and just apply them all, some below have been suggested by others here, some are available in various language texts on the web:
1. Dictionary scan. Locate all words that are not in a German dictionary, and see if they are in another language dictionary. If they're in one other, flag as that language and move on. If they're in several, note this and continue with the rest of the tests.
2. Digram/Trigram scan. A trick borrowed from crypto, the idea is to take an equivalent set of pure german documentation, and generate a table of double and triple word combos, and their probability of appearing within a pure german text. Taking this table and applying it in brute-force fashion across the entirity of the target text should reveal valid german words that are nonetheless out of position, and therefore may well be an equivalent spelt word in another language. A dictionary check between the german and other dictionaries could then confirm this.
3. If you find that the two above aren't getting you enough accuracy, further crypto/language anaylsis tricks can be employed, including sentence position statistics, form statistics (Word has a capital first letter, suggesting it may be a name, rather than a regular word), and basic sentence structure stats (word comes after a known verb, with a suggestion of plural, yet uses english-like plural postfix..), foreign language text analysis (this phrase that has turned up in my document has also turned up in a lot of english trade documents..) etc
..as you can see, you can get as complex as you like here, to boost up the % correctness. I imagine however that the first two tricks in concert, combined with the biggest dictionaries you can find (preferably including common names, places etc) and a good source of modern pure german text for the digram/trigram generation, will provide you with all the accuracy you're likely to need. It should fit in ram in a regular associative array on a decent machine, and even if it doesn't the paging pentalty shouldn't be too great in general. All the above methods should abstract to a database if some kind of dynamic nature is thought to be necessary.
Do not be afraid to think outside the box with the shotgun method, any and all tests you can add, no matter how weird, may help (if I do a search on this word/phrase in google, do I get a high percentage of german language pages in response?).