http://www.perlmonks.org?node_id=996530

BUU has asked for the wisdom of the Perl Monks concerning the following question:

This is slightly more of a conceptual/design problem than an actual code problem, so I'm mostly hoping for suggestions of avenues to explore or modules that do similar things.

I have a largish series of sets (10,000+) that will continue to increase over time. Every set is made up of around 15-100, one to four word phrases, with duplicates. I want to be able to store these sets in my database and then retrieve *similar* sets based on the current set I'm examining.

That is, I have a page in my application that, given a specific set, either pulled from the database or entered by the user, the app needs to go to the database and find all the other sets that are similar based on ideally a configurable variable denoting how much difference is acceptable, i.e. 1% or 5% difference in the sets.

There's a couple of somewhat obvious answers but most of those involve doing the set comparisons in the Perl layer which has the undesirable requirement of loading every single existing set into the database before you can compare against them.

My ideal implementation allows me to perform a single select that will pull out all of the sets that are similar, but I'm open to alternatives, hopefully ones that don't involve comparing against every single existing set.

Thoughts?

  • Comment on Comparing sets of phrases stored in a database?

Replies are listed 'Best First'.
Re: Comparing sets of phrases stored in a database?
by BrowserUk (Patriarch) on Sep 30, 2012 at 20:19 UTC

    The first thing you need to do is define how you, as a human being, would judge the similarity of the sets.

    For example, you start with a set (A), and you make an exact copy (B). You will (presumably) judge these as very similar.

    • What if You remove 1 of the phrases. Are they still similar?

      If the original set contains 100 phrases, and you remove phrases 1 at a time from the duplicate, does the similarity drop linearly?

    • What if you reversed the words in all of the phrases in the second set. Is it still very similar or completely dissimilar?

      Is ordering of the phrase words important.

    • How about if you removed one word from each phrase in the second set?

      Do the phrases need to be exactly the same, to be counted similar.

    • How about if you looked up each word in a thesaurus and substituted the nearest alternative word. Similar? Dissimilar?

      Are looking for semantic similarity.

    • How about if you misspelled every word by one character -- an ommision, and insertion, or transposition. Similar? Dissimilar?

      Can typos occur? Is it possible for you to correct them?

    • How about if you reverse the ordering of the phrases in the second set. Similar? Dissimilar?

      Are the sets ordered or unordered.

    • If one set consists entirely of "large blue woolen jumper" and the other "Angora sweater, navy, XL". Similar? Dissimilar?

      Semantics again.

    Once you've decided how you would make the judgement, then you stand some chance of being able to lay out a set of rules. And once you have that, you can start to look for a good way to implement them.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    RIP Neil Armstrong

      Fortunately I don't actually have to deal with any of that. My actual set of phrases will conform to a corpus of roughly 15,000 existing items, so there are no typos, misspellings or synonyms involved.

      While technically each item in the set is a phrase, for the purposes of this discussion it can be treated as a unique ID of any sort you prefer, but probably a number, probably generated by a hash function.
        My actual set of phrases will conform to a corpus of roughly 15,000 existing items, so there are no typos, misspellings or synonyms involved.

        Then, I would approach the problem this way.

        1. Store the corpus of phrases in its own table each with a unique numeric value.
        2. Each set of phrases then becomes a bitfield with 1-bit set in the appropriate position for each phrase that set contains.
        3. Your similarity can then be some hueristic based on that population counts (bit count) of ANDing and XORing the two bitstrings that represent each set.

          The population count of the result of ANDing two set's bitstrings will tell you how many phrases they have in common;

          The population count of the result of XORing two set's bitstrings will tell you how many phrases that appear in one but not the other.

          You can then combine those two numbers mathematically to reflect whether the sharing of phrases is more important than having phrases not in common -- or vice versa -- and come up with a single number for each pairing that you can then apply a threshold value to.

        You'd need a DB that supports bitstrings -- postgresql and mysql seem to -- and AND/XOR & popcount of bitstrings. I couldn't (from a quick look) see a popcount function, but (at least in the case of PgSQL), it should be a simple thing to add a PL/Perl function to do this using Perl's

        $popcount = unpack '%b*', $bitstring;

        Food for thought perhaps.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

        What is *similar* for you, then?

Re: Comparing sets of phrases stored in a database?
by erix (Prior) on Sep 30, 2012 at 20:33 UTC
Re: Comparing sets of phrases stored in a database?
by sundialsvc4 (Abbot) on Oct 01, 2012 at 15:56 UTC

    As a slight parenthetical comment to BrowserUK’s excellent (and heavily up-voted) advice, I have also found it useful in some situations to create an additional mathematical set (bit-string ...) which represents “categories” of words that are present in the result.   (The word, “category,” meaning precisely whatever-it-makes-sense-to-you for it to mean.)   The idea is that a phrase could therefore quickly be eliminated from consideration if its set of categories-present is not the same.   (If the category taxonomy contained, say, less than 32 bits, it could be represented as an integer.)   The idea being to reduce the search-space, if say some kind of database I/O is required.

    The notion is 150% dependent upon the needs of this application scenario.   First of all, the notion of creating a digest at all must “make sense” in terms of the application spec ... there must be a “natural taxonomy” that the app can usefully exploit “for free.”   Then, it must be shown to called-for, and beneficial.   For instance, in some high-performance critical scenarios, it might well be contra-indicated to store the bit-strings in any sort of database at all:   even a brute-force search of data, if known to entirely fit within the process’s projected resident working-set, avoids all I/O operations entirely.   But if the data were much larger, the notion of creating a lookup digest might apply.