Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Comparing Approximate Items

by artist (Parson)
on Jan 08, 2003 at 16:54 UTC ( #225296=perlquestion: print w/ replies, xml ) Need Help??
artist has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I have situation which I can best describe as follow.

I have 2 sets A and B. Each sets contains few strings. I have to compare and find out approximately which string of set A matches with which string of set B. It's possible that some sttrings will go unmatched. Approximately means differing in letters by 1 to 3 characters. Each string length is max 10 characters. Sets A and B doesn't contain the same number of strings and number of strings in each set could go from 1 to 25.
In actual case these sets are merged and it is most likely that set B string comes after set A string. But that's not true always. To give u an example, I may have patterns like 1. ABAB or 2. ABBA or 3. ABB or 4. ABABAAB. If I get the scenes like AAB or ABB it may not be very clear as which pairs are matching and that's why I need string approximation.

Hope, I explained it properly, feel free to ask anything more.

Thanks,
Artist

Comment on Comparing Approximate Items
Re: Comparing Approximate Items
by tall_man (Parson) on Jan 08, 2003 at 17:12 UTC
    It sounds like you need String::Approx, using amatch with a modified edit distance of 3. Tie::Hash::Approx might be even better except that it doesn't allow you to modify the edit distance, and the default distance of 10% is too close for what you are asking for. Perhaps a hand-hacked version of it with a different compare function would do?

    Modified to include CPAN refs and an additional comment.

      Hi,
      For I just have one correction Sets A and B doesn't contain the same number of strings
      Should read : Sets A and B may not contain the same number of strings

      I use use Algorithm::Diff and that works approximately ok for my match-purpose.

      Aritst

        It's not obvious to me how Algorithm::Diff can be used to compute edit distances. Do you call diff() and then count up the number of insertions and deletions in the result?

        Suppose you have several equidistant matches from a given element in A to several elements in B. Do you have to pick one of them such that the overall number of pairings is maximized? It might be an NP-complete problem.

Re: Comparing Approximate Items
by dree (Monsignor) on Jan 08, 2003 at 23:11 UTC
    You could use Text::Levenshtein

    It is an edit distance, i.e. it is a measure of the degree of proximity between two strings.
    So for example, distance("foo","four") is 2 because you need an edit "SUBSTITUTE" and an edit "INSERT".

    As algorithm I suggest the 'Stable Marriage Problem', a matching algorithm to best fit the "marriage preferences" of two sets.
      I think you are right. Text::Levenshtein is better in this case because String::Approx will match substrings of the input. Here is one more thing. If substitutions are not allowed, only inserts and deletes, you could use Text::WagnerFischer to set the cost of substitution so high that it will not be used.
      use Text::Levenshtein; use Text::WagnerFischer; my $pat = 'AAB'; my @lst = qw(ABAB ABBA ABB ABABAAB); my @dist1 = Text::Levenshtein::distance($pat, @lst); my @dist2 = Text::WagnerFischer::distance([0, 1, 100], $pat, @lst); my ($i, $item); $i = 0; foreach $item (@lst) { print "Levenshtein distance of $item to $pat is ",$dist1[$i],"\n"; print "WagnerFischer distance of $item to $pat is ",$dist2[$i],"\n" +; $i++; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://225296]
Approved by valdez
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (14)
As of 2014-07-11 14:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (227 votes), past polls