comment on

I think what you need is not a regular word list but a pronunciation dictionary - that is, one that lists the pronunciation for each word (form). If you check this, you are basicly left with two cases:

• if the pronunciation follows general german pronunciation rules, then the word is either german, or at least the text-to-speech converter will pronounce it correctly, so you don't need to mark it.

• if the pronunciation violates german pronunciation rules, the word is probably foreign - and then you can check with a dictionary of the corresponding language (see below).

Pronunciation lexicons have the additional advantage that they list word forms, not words, which eliminates the need for stemming. Of course, this works only because german spelling and its mapping to pronunciation is fairly regular.

For the words you don't find in your pronunciation dictionary, you can look at the transition probabilities of the letters: the probability that letter "x" is followed by "y" is very language specific. If you calculate these probabilities from a large list of words for the languages in question, they provide a good criterion. This has the advantage that you will also be able to classify names - which normally don't appear in dictionaries.

This leaves you only with the words that can be both german and foreign - as e. g. "email". But my guess is that there will be only few of them and you can treat them manually (BTW, the pronunciation dictionary should give you two pronunciations of "email" - one that conforms and one that violates german pronunciation rules - so you should be warned).

You won't get around proofreading (at least samples) anyway. But I hope this will help you to minimize the amount of manual corrections.

pike

In reply to Re: detecting the language of a word? by pike
in thread detecting the language of a word? by domm

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Pathologically Eclectic Rubbish Lister
	PerlMonks