|laziness, impatience, and hubris|
How I Created a Catalan-English Dictionary from a Spanish-English Dictionary Using Only String::Approx and Approximately 500 grams of Scots Tabletby Willard B. Trophy (Hermit)
|on Oct 17, 2002 at 19:17 UTC||Need Help??|
I used to program for a dictionary company in Scotland. We had received EU money to produce a Catalan-English dictionary, but the only electronic Catalan resource we had was a word list. We had to produce a dictionary framework to match our Spanish dictionary in double-quick time, ready for our Catalan translators to turn into a finished text. But how? I'm no linguist.
I noticed that Catalan looks a bit like Spanish, but with French word endings (and if that statement doesn't get me a visit from the Catalonian death squad, nothing will). If you fiddle with the ends of the words, you got something that looked almost, but not quite, like Spanish. This goes against nearly all linguistic theory, but seems to work.
Computers don't do almost too well. While casting about for an approximate solution, I found Arizona U's agrep utility, which does approximate searching. Building a shell script around agrep to produce possible matches sort-of worked, but was painfully slow. Conveniently, CPAN librarian Jarkko Hietaniemi had just come out with a new version of String::Approx, which basically did what agrep did, but in a Perl module, and allowed a bit more control of the fuzziness parameters.
The key to approximate matching is the Levenshtein Edit Distance, effectively the number of character changes you can accept in a string that it will still be considered approximately equal to another. Allowing two changes per word, and a barrage of about 10 heuristics (a fancy CompSci word for "guesses") to play with the word form, I got an approximate 70% match rate with the Spanish dictionary text. This was good enough that it saved weeks of manual compilation time, plus had the neat side effect of an amusing correspondence with Jarkko while I suggested improvements to the package.
(For the curious, setting the edit distance to 1 returned too few possible matches. Setting it to 3 generated thousands of false positives.)
And the tablet? What has it to do with it all? Frankly, very little, apart from the fact it's superconcentrated programmer fuel, and I make it to an old family recipe. It's as good as the combination of sugar, condensed milk, butter and real vanilla can be… try some (link is PDF, alas).
Modification, 21 Oct 02002: Recipe now available in XHTML, as it should have been all along: http://www3.sympatico.ca/scruss/tablet.html