Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

This is hell. I don't really see how you can automate this - maybe you should run some kind of wiki the first weeks/months/years so people can change the language themselves when reading it. That way every time someone reads the document, they could also check and improve it.

For the initial job, I would seriously think about making a short dictionary with common 'foreign' words. Words like 'email' must occur frequently, and are easy to catch; rarely used foreign words are basically impossible to catch, when they look too much like an existing word.

What I would do, is try to find out the 'overall language' of a document. Words that don't comply to your language rules (e.g. german database) get first checked to this 'common foreign words' database. If they don't match there (you probably already filtered out the larger part now), add then to a list with a reference to the documents they are found in.

This list will only contain very rarely used foreign words, and you can do them manually; the references will only require you to assign a language once for every occurence.


text blah.html:
Dans un email récent, mon frère a écrit "Was die Augen sehen, glaubt das Herz". Il doit l'avoir entendu quelque part. Gnarf.

analyze language --> 70% chance French
--> set base language to 'French'
--> 'email', 'gnarf' and the German sentence don't match

frequent words db email:en computer:en test:en
--> 'email' matches and has it's language set to 'en'
--> 'gnarf' and the German sentence don't match, the search continues

try to spellcheck sentences in other languages
--> if you find two spelling mistakes in one sentence, you could try to match that whole sentence to another language, and if the spellingcheck returns (near) zero for a certain language, it's most likely that language.
--> 0 spelling errors for the German sentence with spellGerman, set that sentence to German.
--> 'gnarf' still unmatched

uncommon words and expressions db
--> everything that really doesn't make sense ends up here, with references as where that word is found:
"gnarf" blah.html woof.html foobar.txt

manual intervention
--> 'gnarf' set to German
--> all referenced documents with the word/expression are updated to the chosen language

Hmm. I hope this makes sense. :)

This job is one of the hardest possible to automate, because it requires AI, basically. It's not about 100% matching, but rather fuzzy matching, (human) logic and context. The computer actually has to make sense out of the documents. Good luck with that...

I really suggest some kind of wiki-thing too, though. It will make it so much easier if people who read the document can change the language on-the-fly in case of errors.

In reply to Re: detecting the language of a word? by december
in thread detecting the language of a word? by domm

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    [Eily]: good moaning Lanx, good morning everyone else :)

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (6)
    As of 2017-12-14 10:07 GMT
    Find Nodes?
      Voting Booth?
      What programming language do you hate the most?

      Results (388 votes). Check out past polls.