Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I think what you need is not a regular word list but a pronunciation dictionary - that is, one that lists the pronunciation for each word (form). If you check this, you are basicly left with two cases:

• if the pronunciation follows general german pronunciation rules, then the word is either german, or at least the text-to-speech converter will pronounce it correctly, so you don't need to mark it.

• if the pronunciation violates german pronunciation rules, the word is probably foreign - and then you can check with a dictionary of the corresponding language (see below).

Pronunciation lexicons have the additional advantage that they list word forms, not words, which eliminates the need for stemming. Of course, this works only because german spelling and its mapping to pronunciation is fairly regular.

For the words you don't find in your pronunciation dictionary, you can look at the transition probabilities of the letters: the probability that letter "x" is followed by "y" is very language specific. If you calculate these probabilities from a large list of words for the languages in question, they provide a good criterion. This has the advantage that you will also be able to classify names - which normally don't appear in dictionaries.

This leaves you only with the words that can be both german and foreign - as e. g. "email". But my guess is that there will be only few of them and you can treat them manually (BTW, the pronunciation dictionary should give you two pronunciations of "email" - one that conforms and one that violates german pronunciation rules - so you should be warned).

You won't get around proofreading (at least samples) anyway. But I hope this will help you to minimize the amount of manual corrections.


In reply to Re: detecting the language of a word? by pike
in thread detecting the language of a word? by domm

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others browsing the Monastery: (3)
    As of 2017-12-12 03:36 GMT
    Find Nodes?
      Voting Booth?
      What programming language do you hate the most?

      Results (324 votes). Check out past polls.