in reply to
detecting the language of a word?
Many of these technical approaches have merits, however as one very astute monk pointed out, if the problem were easy (or even automatable) babblefish wouldn't be so hilariously inadequate.
In terms of approach, I think that you can consider several different lines of attack that would allow you to automate most of the markup, if not all of it:
- Identify key foreign words for automatic flagging (can you always assume that the word email indicates an email address?). In doing your research, you've probably identified the basic words that should get caught in order to avoid howlers. Work on that list to make sure you're not missing anything obvious
- Look for patterns in foreign word usage. This will require more intuition that anything else, but I would guess, again, that you are beginning to develop a feel for where foreign words are likely to occur. Use automated tools to look for and flag those pages/sections for manual follow-up.
- In my very limited experience, I would guess that these types of words will tend to occur in 1) headers, 2) footers, 3) business and IT terminology. Headers and footers are where you are likely to find contact information, and business and IT terminology tends to be dominated by English (despite the ongoing French crusade to use the word ordinateur)
- If you think that you need to mount what is essentially a dictionary attack, then, to my mind, you need to look at ways to streamline the attack. Could you start off by making the (admittedly arbitrary) decision that words of less than five characters are either 1) not in a foreign language, or 2) not significant enough to be worth looking up in a foreign language? This could rapidly reduce the number of lookups that you need to do on any given page.
- Or, you could again take a contextual approach and mount a dictionary attack based on words of, say, ten characters or more working from the assumption that foreign words will occur in clumps and that at least one of those words will be more than nine characters in length. Then, you are looking to do manual follow-up for sections flagged with a nine-character foreign word. Over time, you could streamline your parser to ignore sections already flagged as containing a foreign language and gradually reduce the length of the words that you examine for foreign content.
This is a really hard problem, good luck.