http://www.perlmonks.org?node_id=218088

domm has asked for the wisdom of the Perl Monks concerning the following question:

I'm starting to work on a rather big project, which involves converting some 18.000 HTML documents of sometimes dubious quality to w3.org-validatable, accessible HTML 4.01. The client is a city government in Austria that wants to / has to comply to Level A of the W3 WAI specifications. (http://www.w3.org/WAI/). Most of this work will be done by a HTML::Parser based parser.

Currently, the most tricky part seems to be language detection, and therefor I seek some wisdom:

The content is basically in German. But it is interspersed with some foreign words, mostly English. E.g.: "email". All foreign words should be marked using something like <span lang='en'>. The reason for this is that browsers with voice output need to know if a word should be pronounced the standard way (i.e. german) or somewhat differently (i.e. english)

E.g: if you pronounce "email" as if it was a german word, it sounds like the german word for "enamel", which is "Email" (btw, enamel is this stuff: http://www.artlex.com/ArtLex/e/enamel.html

So, how can I decide if a given word is German, English, French or Italian?

My best idea so far is to find some dictionary files for each language and check if the word is in one of those. For performance reasons, I'm planning to put the dicts into a SQL-Database (or maybe a DB-file? - but I know SQL better, so..) and maybe implement some caching.

I couldn't find anything suited for this task on CPAN ..

I can probably also use some sort of non-Perl solution, as long as it's free and runs on Linux.

Any pointers/comments about

would be very appreciated.
-- #!/usr/bin/perl for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}

Replies are listed 'Best First'.
Re: detecting the language of a word?
by Abigail-II (Bishop) on Dec 06, 2002 at 16:56 UTC
    There's no way you can decide whether a word is English, German, French, Italian, or whatever other language based on the word alone. Think about it - if it was that simple, there wouldn't be reason to markup the words to aid the speech sofware - the software would already know.

    You can't use a simple lookup in a database, there are words in English that exists in other language as well - which sometimes a totally different meaning. If you are going to automate this (and this isn't an easy task, people have been working on this for decades), you will have to look at the context. You will have to parse the sentences, and actually understand each word. And even then you might have problems with sentences like 'time flies like an arrow'.

    I suggest that you research the literature on computarized linguistics, automatic translations, and speech software.

    Abigail

      There's no way you can decide whether a word is English, German, French, Italian, or whatever other language based on the word alone.

      While this statement is true you can take a damn good stab at the language from the words as in plural. In the post below I suggest processing the doc using hash table lookups and counting the instances of putative german, english, french, and italian words. After this is complete you examine the word counts for each languague. As the document length increases so does the probability that the languague with the highest word count is the languague. In fact for docs over a few dozen words you can be almost certain. So you process the document assuming it is german (default lang) and reprocess it using the correct languague as the dominant template if you were wrong.

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

        But the goal isn't determining the language of the document. The main language is given - German. The goal is to pick out the individual words that are not German, and mark them accordingly.

        Abigail

Re: detecting the language of a word?
by tachyon (Chancellor) on Dec 06, 2002 at 17:18 UTC

    With 18,000 pages of 300+ words that is at least 3 million words to process. Provided you have the memory by far the fastest thing to do will be to put the word lists into hashes in memory. You would then do something like:

    my $german = get_lang_hash('german.txt'); my $english = get_lang_hash('english'); my $french = get_lang_hash('french'); my $italian = get_lang_hash('italian'); my $new_text = ''; for my $word ( split /\b/, $text ) { my $lang = check_word($word); $new_text .= $lang ? qq!<span lang="$lang">word</span>! : $word; } sub check_word { my ($word) = @_; print "got $word\n"; return '' if $german->{$word}; return 'en' if $english->{$word}; return 'fr' if $french->{$word}; return 'il' if $italian->{$word}; return ''; } sub get_lang_hash { my $dict = shift; my %hash; open DICT, $dict or die $!; while (<DICT>) { chomp; $hash{$_}++; } close DICT; return \%hash; }

    By splitting on the boundary we will pass punctuation to the check_word() sub but it should not find a match and thus just return ''. The return order from the check word sub detemines our preference. If it could be german we assume it is. If not we see if it could be english, french or italian in that order. If we don't know what it is we call it german and press on.

    You should modify this code to count the number of putative german, english, french and italian words in a document. If you find that the english count is >> german then you would reprocess the document with a different check_word() function. In this function you would change the priority order so that english is returned first.... Same for each of the other languagues

    You can get an extensive list (250,000) of english words as a flat file word list from http://www.puzzlers.org/secure/wordlists/dictinfo.php The puzzle people seem to have these lists easily and freely available as text files. I presume the same applies for languages other than english.

    Any sort of database means disk reads which will be hundreds or thousands of times slower than using an in memory hash table lookup. With memory so cheap and time expensive....

    Regardless of what you do you want your word lists to be as complete a possible and do any pre processing before you start on the text.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: detecting the language of a word?
by jreades (Friar) on Dec 06, 2002 at 18:13 UTC

    Many of these technical approaches have merits, however as one very astute monk pointed out, if the problem were easy (or even automatable) babblefish wouldn't be so hilariously inadequate.

    In terms of approach, I think that you can consider several different lines of attack that would allow you to automate most of the markup, if not all of it:

    1. Identify key foreign words for automatic flagging (can you always assume that the word email indicates an email address?). In doing your research, you've probably identified the basic words that should get caught in order to avoid howlers. Work on that list to make sure you're not missing anything obvious
    2. Look for patterns in foreign word usage. This will require more intuition that anything else, but I would guess, again, that you are beginning to develop a feel for where foreign words are likely to occur. Use automated tools to look for and flag those pages/sections for manual follow-up.
    3. In my very limited experience, I would guess that these types of words will tend to occur in 1) headers, 2) footers, 3) business and IT terminology. Headers and footers are where you are likely to find contact information, and business and IT terminology tends to be dominated by English (despite the ongoing French crusade to use the word ordinateur)
    4. If you think that you need to mount what is essentially a dictionary attack, then, to my mind, you need to look at ways to streamline the attack. Could you start off by making the (admittedly arbitrary) decision that words of less than five characters are either 1) not in a foreign language, or 2) not significant enough to be worth looking up in a foreign language? This could rapidly reduce the number of lookups that you need to do on any given page.
    5. Or, you could again take a contextual approach and mount a dictionary attack based on words of, say, ten characters or more working from the assumption that foreign words will occur in clumps and that at least one of those words will be more than nine characters in length. Then, you are looking to do manual follow-up for sections flagged with a nine-character foreign word. Over time, you could streamline your parser to ignore sections already flagged as containing a foreign language and gradually reduce the length of the words that you examine for foreign content.

    This is a really hard problem, good luck.

Re: detecting the language of a word?
by hardburn (Abbot) on Dec 06, 2002 at 17:13 UTC

    All I can say is: Good Luck. There are probably enough words in different languages that are spelled exactly the same, but have vastly different meanings and pronuciation, that you'll have a noticabily high rate of error. If you're trying to get the language of an entire document (assuming the language wasn't explicitly set in a META tag or something), you might be able to take lots of words within the text and center in on a single language. Trying to get a single word is probably a lot harder.

    You might be able to center in on a language based on the character set being used. Certain languages (particularly scandinavian languages) tend to have a specific character that no one else has. Asian languages also have completely different glyphs from each other.

Re: detecting the language of a word?
by rasta (Hermit) on Dec 06, 2002 at 16:44 UTC
    I doubt on are there such kind of dictionary files since you, probably, should allow for all wordforms.
    I guess you may check spelling of overall text in certain language and then check all rejected words in other languages.
    I believe Text::Pspell could help you if you choise this approach.

    -- Yuriy Syrota

      I doubt on are there such kind of dictionary files since you, probably, should allow for all wordforms.

      Very complete word lists do exist and are freely available from puzzlers websites like this one where you can get a list of 250,000 words or from the underground hacking sites where thay are used for dictionary attacks on password databases. I will leave it to you to investigate these ;-)

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: detecting the language of a word?
by fruiture (Curate) on Dec 06, 2002 at 18:40 UTC

    I also think the problem is the problem: For example "Email", it is a german word indeed that exists in the DUDEN(1998): it equals Emaille, which _is_ enamel. It is completely correct to interpret it that way. You can't tell for sure whether the occurence of "Email" means electronic mail or that enamel stuff.

    You'll need to define a limited set of words to be replaced beforehand ("Which English terms are known to be used frequently") and to replace them during parser work won't be a problem.

    It'll help and be in the sense of the WAI to put a disclaimer on some home page that asks users to report aural-rendering problems...

    --
    http://fruiture.de
Re: detecting the language of a word?
by adrianh (Chancellor) on Dec 06, 2002 at 20:39 UTC
    Any pointers/comments about * some useful software/libraries * my general approach

    I've done a fair amount of accessibility work, so some general pointers:

    • I'd seriously consider going for XHTML rather than HTML4.01... if you're starting from dodgy HTML it won't be that much more work, and having stuff in XML will make future site changes and content manipulation easier.
    • For your bulk work take a good look at tidy before you spend a lot of time coding a custom perl solution. It will almost certainly do most of what you need.
    • You won't be able to completely automate your translation work - you'll need to have a human in the loop. For example there are cases where you can have the same word in multiple languages, sometimes with different meanings.
    • How is you're final site being audited for WCAG conformance? This cannot be automated since some of the checkpoints rely on human judgement - so make sure you have the audit process sorted before you start. Otherwise you may find yourself facing impossible goals

    Also, if it's not already in one, log the site into some kind of source control system. You will want a log of the changes at some point during the process.

Re: detecting the language of a word?
by PhiRatE (Monk) on Dec 07, 2002 at 14:02 UTC
    Ok. A couple of points. Anyone who has read a few of my posts know that I'm a complete nut for SQL, however in this case I don't really think it'll help you. The search-set for words is primarily static, and would do pretty well in a standard hash.

    To your main problem however, I make a few observations. Firstly, I'm not sure you're ever going to get a 100% accurate method, so if you require that, stop thinking about 100% automatic solutions right now. If on the other hand a 95-99% is ok, then we can proceed.

    Secondly, I believe the first step always has to be to deduce the primary language in use. This can either be pre-set, if you're sure all the documents are basically German, or you can do a fairly easy determination based on word percentages against language dictionaries.

    The final part, determining which words are foreign, submits nicely to what I call the "shotgun" method. In this method, we take a bunch of good ideas, and just apply them all, some below have been suggested by others here, some are available in various language texts on the web:

    1. Dictionary scan. Locate all words that are not in a German dictionary, and see if they are in another language dictionary. If they're in one other, flag as that language and move on. If they're in several, note this and continue with the rest of the tests.

    2. Digram/Trigram scan. A trick borrowed from crypto, the idea is to take an equivalent set of pure german documentation, and generate a table of double and triple word combos, and their probability of appearing within a pure german text. Taking this table and applying it in brute-force fashion across the entirity of the target text should reveal valid german words that are nonetheless out of position, and therefore may well be an equivalent spelt word in another language. A dictionary check between the german and other dictionaries could then confirm this.

    3. If you find that the two above aren't getting you enough accuracy, further crypto/language anaylsis tricks can be employed, including sentence position statistics, form statistics (Word has a capital first letter, suggesting it may be a name, rather than a regular word), and basic sentence structure stats (word comes after a known verb, with a suggestion of plural, yet uses english-like plural postfix..), foreign language text analysis (this phrase that has turned up in my document has also turned up in a lot of english trade documents..) etc

    ..as you can see, you can get as complex as you like here, to boost up the % correctness. I imagine however that the first two tricks in concert, combined with the biggest dictionaries you can find (preferably including common names, places etc) and a good source of modern pure german text for the digram/trigram generation, will provide you with all the accuracy you're likely to need. It should fit in ram in a regular associative array on a decent machine, and even if it doesn't the paging pentalty shouldn't be too great in general. All the above methods should abstract to a database if some kind of dynamic nature is thought to be necessary.

    Do not be afraid to think outside the box with the shotgun method, any and all tests you can add, no matter how weird, may help (if I do a search on this word/phrase in google, do I get a high percentage of german language pages in response?).

Re: detecting the language of a word?
by december (Pilgrim) on Dec 07, 2002 at 08:50 UTC

    This is hell. I don't really see how you can automate this - maybe you should run some kind of wiki the first weeks/months/years so people can change the language themselves when reading it. That way every time someone reads the document, they could also check and improve it.

    For the initial job, I would seriously think about making a short dictionary with common 'foreign' words. Words like 'email' must occur frequently, and are easy to catch; rarely used foreign words are basically impossible to catch, when they look too much like an existing word.

    What I would do, is try to find out the 'overall language' of a document. Words that don't comply to your language rules (e.g. german database) get first checked to this 'common foreign words' database. If they don't match there (you probably already filtered out the larger part now), add then to a list with a reference to the documents they are found in.

    This list will only contain very rarely used foreign words, and you can do them manually; the references will only require you to assign a language once for every occurence.

    e.g.

    text blah.html:
    Dans un email récent, mon frère a écrit "Was die Augen sehen, glaubt das Herz". Il doit l'avoir entendu quelque part. Gnarf.

    analyze language --> 70% chance French
    --> set base language to 'French'
    --> 'email', 'gnarf' and the German sentence don't match

    frequent words db email:en computer:en test:en
    --> 'email' matches and has it's language set to 'en'
    --> 'gnarf' and the German sentence don't match, the search continues

    try to spellcheck sentences in other languages
    --> if you find two spelling mistakes in one sentence, you could try to match that whole sentence to another language, and if the spellingcheck returns (near) zero for a certain language, it's most likely that language.
    --> 0 spelling errors for the German sentence with spellGerman, set that sentence to German.
    --> 'gnarf' still unmatched

    uncommon words and expressions db
    --> everything that really doesn't make sense ends up here, with references as where that word is found:
    "gnarf" blah.html woof.html foobar.txt

    manual intervention
    --> 'gnarf' set to German
    --> all referenced documents with the word/expression are updated to the chosen language

    Hmm. I hope this makes sense. :)

    This job is one of the hardest possible to automate, because it requires AI, basically. It's not about 100% matching, but rather fuzzy matching, (human) logic and context. The computer actually has to make sense out of the documents. Good luck with that...

    I really suggest some kind of wiki-thing too, though. It will make it so much easier if people who read the document can change the language on-the-fly in case of errors.

Re: detecting the language of a word?
by pcs305 (Novice) on Dec 06, 2002 at 17:56 UTC
    There are usually a set of terms and words assimilated from other languages that are common and standard.

    (I am talking about Afrikaans(Dutch kinda). We use a lot of English, French and other language words and terms.)

    You should be able to get a list of these commonly used words and terms rather than using a complete English Dictionary and use that list to check if the word exists.

    The drawback is that someone will have to maintain this table of words. Usually the Language institute or Universities have lists like that.

    Good Luck
    Ian

Re: detecting the language of a word?
by mooseboy (Pilgrim) on Dec 06, 2002 at 20:27 UTC

    Hmm... this is decidedly non-trivial. One extra thing to bear in mind is that the German spoken in Austria (where I happen to live) is very different from the German spoken in Germany. In addition, there are lots of regional dialects within Austria, so pretty much any 'standard' German word list that you might use will likely not have whatever Austrian dialect words might be in the original German. That being the case, you'll probably need a supplementary list of the dialect terms, at the very least.

    If you can tell us which city it is and/or give us a URL, I might be able to offer some further pointers, but it's pretty much inevitable that whatever approach you adopt will be laden with traps for the unwary. Good luck anyway!

    Cheers, mooseboy

Re: detecting the language of a word?
by pg (Canon) on Dec 06, 2002 at 17:59 UTC
    One reminder, artificial intelligence as whole is a failure, although I don't deny there are some limited success in very limited number of particular areas. Machine translation is one of those things we wanted to archieve with artificial intelligence, and failed. Now you are talking about something probably even bigger than machine translation.

    This has nothing to do with Perl, no language, no tool, no project so far archieved this, and during the recent years, the fever is dying down.

    There are lots of foundamental problems, just give two examples:
    1. When to look up the dictionaries? Say you have a stream of words, majority English with some German, you see a word, is this a German or English? What if it exists in both languages? will you then also check grammar to make a determination, see which fits better in the overall sentence? It is getting bigger, is it?
    2. You see a word, and suspect it is not English, will you look it up in a German dictionary, or a Spanish dictionary?
      One reminder, artificial intelligence as whole is a failure, although I don't deny there are some limited success in very limited number of particular areas. Machine translation is one of those things we wanted to archieve with artificial intelligence, and failed.

      Somebody should really tell all those people working away in AI research around the world that they're wasting their time then :-)

      AI has certainly been wildly over-hyped at various times - but tons of useful stuff has come out of it, and still is. The problem is that soon as it becomes popular people stop classifying it as AI. When I was a student GAs, expert system, GPSG parsers, neural nets, etc. were all AI. Now they're mainstream :-)

      <Adrian briefly considers his AI degree, sighs, and goes back to writing some perl>

      "One reminder, artificial intelligence as whole is a failure..."

      Was there a time limit for AI that we somehow missed? The popularity of things come and go, that doesn't mean that no one is still working on those things.

      (And I agree with point 1, but the answer to point 2 is "yes".)

Re: detecting the language of a word?
by BrowserUk (Patriarch) on Dec 06, 2002 at 19:37 UTC

    Depending upon the number of foriegn words your looking at, it mght be better to run through your files verifying the words against the dictionary for the predominant language, and flag any that do not show up.

    You could write the name of the files to a "pages to check" file, and wrap the words in something glaringly obvious (like the hated <blink> tags :). Then you (or your native language editor person) could look at the suspect words in context and make a decision based on that. Of course that won't help you with words like your example that have meanings in several different languages.

    Probably the best way to deal with that is to also flag any words that show up in more than one language dictionary.

    I think that if performance is anything of an issue, then you should probably avoid storing your dictionaries in a SQL database. However, using the DBI interface to one of the flat-file databases you can achieve some pretty amazing performance as was prooved to me by grantm in this thread Fast wordlist lookup for game.


    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Re: detecting the language of a word?
by jjdraco (Scribe) on Dec 06, 2002 at 17:32 UTC
    My suggestion, and I could be wrong, is you would have to try and find the word in each of the languages dictinaries you have and any word that came up in more than one language would require user intervention to decide what language it is. possibly at the end of parsing all the files you get a summery with all the words that are undetermend, and the sentince the word was found in and the languages it was found in, they the user just has to decide which one it is.
    it's not a very optimized approach but least the way I see it, it should work.

    jjdraco
    learning Perl one statement at a time.
      No, I don't think that's the right approach. There are a lot of words in the documents, and there will be words misspelled. You don't want a misspelled word to be flagged as an Italian word, just because that misspelling happens to be a random Italian word, and it doesn't exists in German or French.

      Abigail

        I had thought about the misspelled condition after I made the post and I was thinking along the lines that it wouldn't show up in any list and then it would be up to the user to deciede. but you're right, there is the possibility that its the correct spelling for a word in another language. No matter what the original poster does, the documents are going to have to be proof read by hand to check for any such mistakes.

        jjdraco
        learning Perl one statement at a time.
Re: detecting the language of a word?
by Anonymous Monk on Dec 06, 2002 at 20:43 UTC
    No matter how good your list of words, it won't be good enough. However, it would be interesting to look at an AI approach that used the word lists along with some knowledge of grammer to implement an AI alg. I know a previous poster had said AI is a failure, but I think an AI approach may serve you well here. I'm thinking about some kind of neural network might actually do a decent job. Between knowing that most will be German, having word lists, etc, I feel like you could put together a relatively successful neural net to guess the language of a word. Of course it won't be error free, but I bet whatever your solution is to this hum-dinger, it will contain some errors. Good luck! This is a terrific problem and I'd love to know how you solve it.
Re: detecting the language of a word?
by pike (Monk) on Dec 09, 2002 at 10:20 UTC
    I think what you need is not a regular word list but a pronunciation dictionary - that is, one that lists the pronunciation for each word (form). If you check this, you are basicly left with two cases:

    • if the pronunciation follows general german pronunciation rules, then the word is either german, or at least the text-to-speech converter will pronounce it correctly, so you don't need to mark it.

    • if the pronunciation violates german pronunciation rules, the word is probably foreign - and then you can check with a dictionary of the corresponding language (see below).

    Pronunciation lexicons have the additional advantage that they list word forms, not words, which eliminates the need for stemming. Of course, this works only because german spelling and its mapping to pronunciation is fairly regular.

    For the words you don't find in your pronunciation dictionary, you can look at the transition probabilities of the letters: the probability that letter "x" is followed by "y" is very language specific. If you calculate these probabilities from a large list of words for the languages in question, they provide a good criterion. This has the advantage that you will also be able to classify names - which normally don't appear in dictionaries.

    This leaves you only with the words that can be both german and foreign - as e. g. "email". But my guess is that there will be only few of them and you can treat them manually (BTW, the pronunciation dictionary should give you two pronunciations of "email" - one that conforms and one that violates german pronunciation rules - so you should be warned).

    You won't get around proofreading (at least samples) anyway. But I hope this will help you to minimize the amount of manual corrections.

    pike

Re: detecting the language of a word?
by domm (Chaplain) on Dec 09, 2002 at 08:46 UTC
    Oh, wow, I've been away for the weekend, and now i've got a whole lot of very usefull replys to my question!

    ++ to everyone how shared her/his ideas!

    I cannot comment on ever reply right now, but I'm planning to write some sort of paper about how we will tackle this problem later this week...

    -- #!/usr/bin/perl for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}