Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

Fuzzy matching to user's tastes?

by Seumas (Curate)
on Jun 27, 2003 at 15:21 UTC ( [id://269619] : perlquestion . print w/replies, xml ) Need Help??

Seumas has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to add a feature to my site that let's users see a list of items the system believes they may like based on their prior selected items. I have no mathematical background so I'm sure there are far superior ways to accomplish this with great accuracy than what I'm planning on trying. If you have a site or node that I should familiarize myself with, please guide me to it. I haven't found much to help me with ideas or algorithms for this so far.

My plan, however, is to do something like this:

  • Parse the title of all user's previously selected items. Throw out common words like "the, and, is, size".
  • Build a hash where the key is each of the unique words found. As we parse the titles, we see if a key exists(). If it does, we just ++ it's value. If it doesn't, we create the key in the hash and ++ it.
  • We convert this hash to something storable so I can stick the results in an SQL database. (Is there any way to do this other than just creating one long field in the database like "word::count, word2::count" and parsing it outside of the database each time? I'm not sure how I would store that data in a table since I obviously can't create all the columns ahead of time if I let each column stand for a unique word (no, I'm not a DBA either).
  • We keep track of the number of selections in each category. Each category gets a ++ for each item that has been selected in a category by the user.
  • To find the top associated items for this user to look at, we query the database for their record, split the record and stuff the word::count values into a hash again. Then we compare each of the words in the title of the new item to the top N words in the user's record. We ++ a match-value for the new item for each word in its title that matches our user's records. Then we ++ the match-value if it is in a category the user has selected before.

For example, if we have an item called "really ugly petite blue pants", we would compare each word against the user's records (that we've pulled out and stuffed into a hash). We find that petite and pants match in the top of the user's records (petite and pants are some of that most frequently encountered words in the titles of items the user has selected in the past). Then we check to see if the category that the item is in matches any categories the user has previously selected items from. If it does, we add to its score. We do this for each item and then the top N items on the entire site are displayed on a page for the user to look at.

I think this would work. I'm not sure how well, but... it would work. My biggest fear is that this is a hell of a lot of processing to do! Especially if you're talking about users who may have hundreds, thousands or even tens of thousands of items in their history and a site with easily 5,000 items to match against. Imagine having to do the above steps 5,000 times and storing it all in a hash temporarily while you pick through what the top items are and build up the scores!

So I'm looking for improvements, alternatives... most any suggestions whatsoever.

Added <readmore> at author's request - dvergin 2003-06-27

Replies are listed 'Best First'.
Re: Fuzzy matching to user's tastes?
by tilly (Archbishop) on Jun 27, 2003 at 15:36 UTC
    It sounds to me like you want to read Building a Vector Space Search Engine in Perl.

    However one tip. Take a look at Amazon. The key to what they do is they look at purchase patterns. If someone buys X and Y, and someone else has just bought X, then recommend Y to them. The words in the name are unlikely to be a good indicator of what else they might like. But odds are that their tastes are similar to someone else who just bought the same thing.

    I haven't thought about how to implement that, but were I you I would let the idea sit, and start thinking about what I need to track about people to do that.

      The vector space search looks like it could possibly handle what I need, although if it keeps everything in-memory, it's going to eat up a ton of resources (it would have to parse the entire group of documents each and every time the script is run). Perhaps using the Storables method above along with this would be a good solution.

      The other difficulty (and why the amazon-method would not work) is that this is an auction site. It is very unlikely that two people would ever purchase (ie, win) the same item with the exact same title. And because it's an auction site, those 5,000 or more items will not always be the same 5,000 items. In fact, the turnover rate would be in the hundreds-per-day.

      Also, these "documents" would of course be contained in a database. I'm not sure how much that throws things off. I suspect it doesn't from my reading of the article you linked to.

        If you are worried about only one person getting any specific item, look at who bids instead. Then use some kind of classifier on the bidding patterns to find each user's k nearest neighbors for some k. When a user bids on something, message his neighbors (unless they've been messaged already).

      Tilly, do you know of any implementation of this using a database? I'm sure I could manage to move things around so that I could use it with Postgresql, but if that's already been done, why reinvent the wheel. I've dug around and have not found such an example, though.

      Just taking my chances and asking. Never know . . . :)
        Sorry, I don't. But from there you should get some names of people who do this stuff, and you might get something better from asking them.

        What I tend to do is remember various useful resources. When I see a question I can often pair it with something which is likely to be a useful starting point. But my ability to do that just means that I saw and remembered that something relates - I don't necessarily know much about the topic in question (nor would I claim to).

        And so it is in this case...

Re: Fuzzy matching to user's tastes?
by cfreak (Chaplain) on Jun 27, 2003 at 15:47 UTC

    That's a pretty cool idea though I don't think it has to be quite that complicated. I think you could probably get away with not storing the entire user's history but instead storing maybe the last 10 or so items. Then every time they search get the top keywords and save them in a more permenant location. The top words from your perminent location have the highest rank followed by all key words from the last 10. Feed all that to a site search and return the most relevent results.

    Lobster Aliens Are attacking the world!
Re: Fuzzy matching to user's tastes?
by Tomte (Priest) on Jun 27, 2003 at 15:32 UTC

    Store and retrieve your datastructure(s) with Storable, freeze your data before storing, and thaw it after fetching from the store.

    As for the other questions, I can't think anymore today, just an hour more, and I can call it a week...


    Hlade's Law:

    If you have a difficult task, give it to a lazy person --
    they will find an easier way to do it.