http://www.perlmonks.org?node_id=816086

kiz has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking to do Fuzzy matching on text, and struggling to come up with a sensible solution.

I'm always comparing 2 strings, so Levenshtein distances are a possible, however as String::Approximate points out
"If you want to compare things like text or source code, consisting of words or tokens and phrases and sentences, or expressions and statements, you should probably use some other tool than String::Approx"

Here are some examples of text:
These two obviously match... they'd have a distance of 6, or 14% different
  • Aberdeen University Research Archive: AURA
  • Aberdeen University Research Archive
These two also obviously match... however they'd have a distance of 9, or 37% different
  • Archivio Marini
  • Archivio Giuliano Marini
However these two probably shouldn't match... even though they'd have a distance of 5, or 20% different
  • CCLRC ePublication Archive
  • STFC ePublication Archive
But what on earth does one do here - a distance of 39 over 50% different!
  • arXiv.org e-Print archive (physics, mathematics, related fields)
  • arXiv.org e-Print archive
... and I just dispare over this one:
  • Cracow University of Technology Digital Library
  • Biblioteka Cyfrowa Politechniki Krakowskiej (Digital Library of Cracow University of Technology)

Thoughts, opinons, suggestions greatfully sought....

(I don't think Text::Soundex will help either... :sad:)



-- Ian Stuart
A man depriving some poor village, somewhere, of a first-class idiot.

Replies are listed 'Best First'.
Re: Fuzzy text matching... again
by Limbic~Region (Chancellor) on Jan 07, 2010 at 14:28 UTC
    kiz,
    You are not always comparing two strings. You are mentally tokenizing these strings. Further, you are giving context to them - things in parens at the end do not seem to be important to you. In other words, you have to build a solution that applies the same mental process as yourself to determine if the reference sources are the same. I have built a solution that does this in the past but since I was hired to do it, I can't share the code with you.

    There are plenty of tools to build your own but you have to figure out how to glue them together. This is not that simple, so neither can your approach. I used a layered approach. Let me give you some things to consider.

    • Number of tokens in strings
    • Order of tokens in strings
    • Proximity of tokens (Archivio Marini vs Archivio Giuliano Marini vs Archivio Not Very Close At All Marini)
    • Remove unimportant tokens (trailing text in parens)
    • Normalize tokens
      • Abbreviations
      • Typos
      • Alternative spellings (color and colour)
      • Punctuation
    • When using an edit distance on individual tokens, consider the length of the tokens not just the distance

    Now consider all the tools in your tool bag and how they may be useful. Here are some examples:

    I can see you have already searched CPAN and know about things like Text::Compare and Text::PhraseDistance but these seem to be publication references. There are a number of modules on CPAN for citations and bibliography references - you may be able to leverage them as well. It would also be helpful to know more about the overall project because there are some other tools that may be helpful. For instance, do you have a known list of publications and have a list that needs to be identified or do you have one huge bunch and are trying to identify duplicates? The approach I would take is different in both case.

    I have a stack full of notes on the topic of text comparison and analysis I have been meaning to write about at length. If you need more help, speak up.

    Cheers - L~R

      • Remove unimportant tokens (trailing text in parens)

      You certainly have a number of good points.  However, parenthesization need not necessarily imply subordination, it could also stand for an alternative name/term, as in the OP's case #5:

      • Cracow University of Technology Digital Library
      • Biblioteka Cyfrowa Politechniki Krakowskiej (Digital Library of Cracow University of Technology)

      where the part in parentheses has a better chance of contributing to a successful match than the unparenthesized part.  Just to illustrate one of the many potential issues the OP might encounter.

      And while we're at it: how would a machine identify what is a name and what not - as in "Archivio Giuliano Marini" - without consulting either a database of common names, or checking against a list of all known regular words (+ inflections) in a particular language?   Even Google translate apparently gets it wrong when translating "Archivio Giuliano Marini" into English (leaving "Archivio" as is, instead of translating it to "archive" — even though you tell it what source language it is), while it gets it right (interestingly) with "Archivio Marini"...

        almut,
        The determination of what is unimportant is something the OP will have to come up with. To be honest, until you pointed it out as a translation, I had no idea. Obviously nothing will be perfect, which is why I like the "prospector" method and continous refinement. I know it isn't bayesian filter heuristics but I have been quite successful with it.

        And while we're at it: how would a machine identify what is a name and what not

        I can't remember mentioning names anywhere in my response. Name matching has a completely different kind of complexity (Mark and Mary only have an edit distance of 1; Richard, Rich, Dick, Dickie may all refer to the same person; cultural, ethnic and religious variations to a name; gender; variations in converting from native to latin-1; etc). And yes, there are huge databases of names (common and otherwise) to deal with this. Check out Global Name Recognition software owned by IBM for a very expensive solution to that problem.

        Of course, names are not the only specific kinds of strings that have their own complexity. Mailing addresses, dates, identifying an anonymous author, identifying plagerism, indexing, etc. I limited my advice to the problem as I understood it. I obviously could have missed the mark thinking these were publication citations making the "what is a name" germane but I don't see why it can't just be treated like any other token. I obviously missed the trailing parens sometimes being important as a translation but I assume the OP will be capable of constructing an approach for removing unimportant tokens.

        Cheers - L~R

Re: Fuzzy text matching... again
by almut (Canon) on Jan 07, 2010 at 12:35 UTC
    ...obviously match...

    What you call "obvious" involves understanding the information (like we humans do), based on world-knowledge etc.  IOW, to successfully tell similarity as you have it in mind, you'd want artificial intelligence — which is unfortunately (or luckily, however you want to put it :) still not as advanced as some would expect...

Re: Fuzzy text matching... again
by BioLion (Curate) on Jan 07, 2010 at 13:14 UTC

    This seems like a really interesting problem, and one that i am sure that a lot of people are working on in one form or another - it reminds me of semantics / natural language processing, and this is where i worry - This is NASA-level stuff!

    Is there any way to simplify your problem? If the strings are generally short, Rata's suggestions could be a good start, but with longer strings the complexity of the comparisons would rapidly increase - say for example comparing the existence and ordering of words / sub-strings.

    Again though we come back to the same problems though, because you can compare them all you like, but ultimately you need some 'threshold' or series of conditions which you accept as a match, which as you already point out, is difficult (compare the 'Aberdeen' example with the 'ePub archive' one!

    As has already been pointed out, without *understanding* the text, this is nigh impossible... and if you achieved a workable (not even fast) solution, great things await! So i guess I would start with something tangible like binning / flagging the compared pairs into groups of 'similar differences' - i.e. "Interpolated word", "Different word order", "Appended", "Pre-pended" etc...

    At least then if you think of anything clever to distinguish real and false from particular class of comparisons, it is easier to slot in the code...

    This really does sound like an interesting and relevant problem, but i should imagine it will get very complex, very fast, unless you can simplify your criteria beyond "obviously the same" and vice versa!

    Just a something something...

      This is NASA-level stuff!

      Well... but so is this.

Re: Fuzzy text matching... again
by Ratazong (Monsignor) on Jan 07, 2010 at 12:53 UTC

    An interesting problem. ;-)

    It seems you need further criteria than just Levenshtein. I would implement some word-based checks additionally (whose results you have to combine somehow). My ideas are

    • if one string is substantial longer than the other, check if all words of the shorter string are contained in the longer
    • if some words contain mostly capital letters, expect them to be in both strings (and exclude them from the Levenshtein-check)

    Of course you need to experiment with the weights of additional checks (what mean "substantial longer" (more words?), "mostly capital letters" (>=80%?)...), and of course you can easily construct examples where they won't help you (e.g. George W. Bush jr. is not the same as George W. Bush sr.). ... but maybe it helps.

    Please keep us up-to-date with your solution!

    Rata
Re: Fuzzy text matching... again
by stefbv (Curate) on Jan 07, 2010 at 13:29 UTC
    Just an idea..., doesn't matter how complex your criteria will be, it will not cover all the possibilities, but maybe, (depending on the amount of data we talk about here), you can log the data that doesn't match for later review by an human operator.
Re: Fuzzy text matching... again
by Anonymous Monk on Jan 07, 2010 at 14:55 UTC
    You might consider looking at this. I have had success with it in a few projects that required the approximate string matching you describe.
Re: Fuzzy text matching... again
by NiJo (Friar) on Jan 07, 2010 at 16:19 UTC
    A couple of obvious stop words should be deleted from the strings before comparing. I'd also play with comparing and rearranging parts of the strings without suffering from the explosion of possible combinations. Giving additonal words a low score could also help.

    If your final goal is a data base cleanup, I'd see all algorithms only as a help for the human editor. Presenting him a structured text file of potential matches is all I'd do. With proper format the editor can cut and paste to reassign the remaining errors.

Re: Fuzzy text matching... again
by spx2 (Deacon) on Jan 07, 2010 at 13:43 UTC
    I'd be interested in a solution also. I've been recommended on several occasions Text::Soundex but I really don't know how good it is(haven't tried it). It's certainly a problem where one metric alone won't suffice.
Re: Fuzzy text matching... again
by FloydATC (Deacon) on Jan 07, 2010 at 20:06 UTC
    Just a thought: How about splitting each string into a set of words and see how the two sets compare? Unlike a simple substring check this would go a long way towards solving the last example of yours.

    -- Time flies when you don't know what you're doing