XP is just a number PerlMonks

### Comparing Approximate Items

by artist (Parson)
 on Jan 08, 2003 at 16:54 UTC Need Help??
artist has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I have situation which I can best describe as follow.

I have 2 sets A and B. Each sets contains few strings. I have to compare and find out approximately which string of set A matches with which string of set B. It's possible that some sttrings will go unmatched. Approximately means differing in letters by 1 to 3 characters. Each string length is max 10 characters. Sets A and B doesn't contain the same number of strings and number of strings in each set could go from 1 to 25.
In actual case these sets are merged and it is most likely that set B string comes after set A string. But that's not true always. To give u an example, I may have patterns like 1. ABAB or 2. ABBA or 3. ABB or 4. ABABAAB. If I get the scenes like AAB or ABB it may not be very clear as which pairs are matching and that's why I need string approximation.

Hope, I explained it properly, feel free to ask anything more.

Thanks,
Artist

Replies are listed 'Best First'.
Re: Comparing Approximate Items
by tall_man (Parson) on Jan 08, 2003 at 17:12 UTC
It sounds like you need String::Approx, using amatch with a modified edit distance of 3. Tie::Hash::Approx might be even better except that it doesn't allow you to modify the edit distance, and the default distance of 10% is too close for what you are asking for. Perhaps a hand-hacked version of it with a different compare function would do?

Modified to include CPAN refs and an additional comment.

Hi,
For I just have one correction Sets A and B doesn't contain the same number of strings
Should read : Sets A and B may not contain the same number of strings

I use use Algorithm::Diff and that works approximately ok for my match-purpose.

It's not obvious to me how Algorithm::Diff can be used to compute edit distances. Do you call diff() and then count up the number of insertions and deletions in the result?

Suppose you have several equidistant matches from a given element in A to several elements in B. Do you have to pick one of them such that the overall number of pairings is maximized? It might be an NP-complete problem.

Re: Comparing Approximate Items
by dree (Monsignor) on Jan 08, 2003 at 23:11 UTC
You could use Text::Levenshtein

It is an edit distance, i.e. it is a measure of the degree of proximity between two strings.
So for example, distance("foo","four") is 2 because you need an edit "SUBSTITUTE" and an edit "INSERT".

As algorithm I suggest the 'Stable Marriage Problem', a matching algorithm to best fit the "marriage preferences" of two sets.
I think you are right. Text::Levenshtein is better in this case because String::Approx will match substrings of the input. Here is one more thing. If substitutions are not allowed, only inserts and deletes, you could use Text::WagnerFischer to set the cost of substitution so high that it will not be used.
```use Text::Levenshtein;
use Text::WagnerFischer;

my \$pat = 'AAB';
my @lst = qw(ABAB ABBA ABB ABABAAB);
my @dist1 = Text::Levenshtein::distance(\$pat, @lst);
my @dist2 = Text::WagnerFischer::distance([0, 1, 100], \$pat, @lst);

my (\$i, \$item);
\$i = 0;
foreach \$item (@lst) {
print "Levenshtein distance of \$item to \$pat is ",\$dist1[\$i],"\n";
print "WagnerFischer distance of \$item to \$pat is ",\$dist2[\$i],"\n"
+;
\$i++;
}

Create A New User
Node Status?
node history
Node Type: perlquestion [id://225296]
Approved by valdez
help
Chatterbox?
 [davido]: Poor dazz.

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2018-05-24 06:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
World peace can best be achieved by:

Results (174 votes). Check out past polls.

Notices?