perl_seeker has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
yet another problem on which I need your advice. I am trying to use the Text::Levenshtein module authored by Dree Mistrut from CPAN, to calculate the Levenshtein distance between two strings.

I could of course email the author, but I thought I would post it here too.

The module seems to work with ASCII strings e.g the following strings in the Times New Roman font(ascii)
test tent distane:1
But I need to work with a font for another language(Not English). I need to work with the AS-TTDurga font for Assamese which uses ISCII encoding.

So my ISCII strings would look like this if I set the font to Times New Roman (I compose my ISCII text in a word processor that supports the AS-TTDurga font,save as text in say in a Notepad file.)
] ]] distance:0
When I use this module for any two ISCII strings, I keep getting distance between the strings=0 which is not the result I need.

In the AS-TTDurga font, a single letter(vowel/consonant), may sometimes be mapped to two or more ascii characters, e.g.
letter a in my font = ascii chars sd letter b in my font = ascii chars !#
So for the string ab, I would have sd!# in my text file.

How do I get this module to work? (Please bear with my ignorance.)Or, if I need to write my own code to calculate the distance between the strings, how do I go about it?
Please help!

Replies are listed 'Best First'.
Re: Text::Levenshtein question
by polettix (Vicar) on Mar 31, 2005 at 10:35 UTC
      Hi Flavio,
      thanks a lot for the info, and the link to the script(great!). I'll probably give this a try i.e convert to utf-8, then try Text::Levenshtein.

      Thanks to Zaxo and Pustular Postulant too for their comments and docs referred.

Unsupported Language Encoding (Re: Text::Levenshtein question)
by Zaxo (Archbishop) on Mar 31, 2005 at 10:35 UTC

    I had a fine answer for you - translate to utf8 and Text::Levenshtein ought to work. Unfortunately, a quick grep for "ASSAM" over /usr/lib/perl5/5.8.4/unicore/*.txt turns up nothing. Encode::Supported also admits to weakness in Indic languages, with no mention of ISCII.

    There are influential linguists in the perl crowd. Perhaps support could be added, but you'd probably need to help with the effort.

    After Compline,

Re: Text::Levenshtein question
by cog (Parson) on Mar 31, 2005 at 10:27 UTC
    authored by Dree Mistrut

    I could of course email the author

    It seems that somebody forgot to update the documentation of Text::Levenshtein, but the truth is that the module no longer seems to be maintained by Dree. It is on Josh Goldberg's CPAN directory, so if you eventually contact somebody, it probably should be him.


      thanks for the info.

Re: Text::Levenshtein question
by tlm (Prior) on Mar 31, 2005 at 10:27 UTC

    I don't know about ISCII, but Perl 5.8.x can work with Unicode (which supposedly subsumes ISCII). Unless you come across something more specific, you probably should start with

    % perldoc perlunicode % perldoc perllocale

    the lowliest monk

Re: Text::Levenshtein question
by bageler (Hermit) on Mar 31, 2005 at 16:38 UTC
    Josh Goldberg here! As others have said...this module (like most of perl) will not be very happy with multibyte characters unless perl is told what it is dealing with because this algorithm depends heavily on substr. I've never worked with ISCII, but japanese multibyte encodings behaved as I describe.
      Hello Josh,

      think I'll do as mentioned above. Will be glad to know of any support for ISCII added to the module.