Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
I think what you need is not a regular word list but a pronunciation dictionary - that is, one that lists the pronunciation for each word (form). If you check this, you are basicly left with two cases:

• if the pronunciation follows general german pronunciation rules, then the word is either german, or at least the text-to-speech converter will pronounce it correctly, so you don't need to mark it.

• if the pronunciation violates german pronunciation rules, the word is probably foreign - and then you can check with a dictionary of the corresponding language (see below).

Pronunciation lexicons have the additional advantage that they list word forms, not words, which eliminates the need for stemming. Of course, this works only because german spelling and its mapping to pronunciation is fairly regular.

For the words you don't find in your pronunciation dictionary, you can look at the transition probabilities of the letters: the probability that letter "x" is followed by "y" is very language specific. If you calculate these probabilities from a large list of words for the languages in question, they provide a good criterion. This has the advantage that you will also be able to classify names - which normally don't appear in dictionaries.

This leaves you only with the words that can be both german and foreign - as e. g. "email". But my guess is that there will be only few of them and you can treat them manually (BTW, the pronunciation dictionary should give you two pronunciations of "email" - one that conforms and one that violates german pronunciation rules - so you should be warned).

You won't get around proofreading (at least samples) anyway. But I hope this will help you to minimize the amount of manual corrections.

pike


In reply to Re: detecting the language of a word? by pike
in thread detecting the language of a word? by domm

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others musing on the Monastery: (13)
    As of 2014-10-20 11:51 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      For retirement, I am banking on:










      Results (75 votes), past polls