Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re: detecting the language of a word?

by jjdraco (Scribe)
on Dec 06, 2002 at 17:32 UTC ( #218105=note: print w/replies, xml ) Need Help??

in reply to detecting the language of a word?

My suggestion, and I could be wrong, is you would have to try and find the word in each of the languages dictinaries you have and any word that came up in more than one language would require user intervention to decide what language it is. possibly at the end of parsing all the files you get a summery with all the words that are undetermend, and the sentince the word was found in and the languages it was found in, they the user just has to decide which one it is.
it's not a very optimized approach but least the way I see it, it should work.

learning Perl one statement at a time.

Replies are listed 'Best First'.
Re: detecting the language of a word?
by Abigail-II (Bishop) on Dec 06, 2002 at 18:36 UTC
    No, I don't think that's the right approach. There are a lot of words in the documents, and there will be words misspelled. You don't want a misspelled word to be flagged as an Italian word, just because that misspelling happens to be a random Italian word, and it doesn't exists in German or French.


      I had thought about the misspelled condition after I made the post and I was thinking along the lines that it wouldn't show up in any list and then it would be up to the user to deciede. but you're right, there is the possibility that its the correct spelling for a word in another language. No matter what the original poster does, the documents are going to have to be proof read by hand to check for any such mistakes.

      learning Perl one statement at a time.
        Though no perfect solution exists, a workable solution is better than no solution. With some modification the process offered by jjdraco can be made more reliable.

        1. Use the other language dictionaries to strip all words that appear in other languages from the german dictionary and place them in a secondary german dictionary. This leaves the primary dictionary with only uniquely german words and all occurances can be safely ignored.

        2. Your processor should have two modes, a reporting mode and an inspection/correction mode. In reporting mode you processor will simply run over the document gather information about words that are not in the primary dictionary. Have it report on statistics on running time, how many matches were made, and the most common matches. Using this you can ensure your checker runs in a reasonable amount of time and doesn't proviode a prohibitively large number of words for inspection. You will also be able to look at the most common matches to see if they can be reliably processed in an automatic way with some addition scripts. If so, then you can probably eliminate a large percentage of the words that would otherwise have to be inspected. Repeat this step until you have successfully taken out all you can automatically.

        3. Run the processor in inspection mode so each non-match can be found and editted. Have the processor use the sencondary dictionaries to offer the inspector choices of automatic entries or to manually edit it.

        The early worm gets the bird.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://218105]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (None)
    As of 2021-10-19 02:20 GMT
    Find Nodes?
      Voting Booth?
      My first memorable Perl project was:

      Results (76 votes). Check out past polls.