Seumas has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to add a feature to my site that let's users see a list of items the system believes they may like based on their prior selected items. I have no mathematical background so I'm sure there are far superior ways to accomplish this with great accuracy than what I'm planning on trying. If you have a site or node that I should familiarize myself with, please guide me to it. I haven't found much to help me with ideas or algorithms for this so far.

My plan, however, is to do something like this:

  • Parse the title of all user's previously selected items. Throw out common words like "the, and, is, size".
  • Build a hash where the key is each of the unique words found. As we parse the titles, we see if a key exists(). If it does, we just ++ it's value. If it doesn't, we create the key in the hash and ++ it.
  • We convert this hash to something storable so I can stick the results in an SQL database. (Is there any way to do this other than just creating one long field in the database like "word::count, word2::count" and parsing it outside of the database each time? I'm not sure how I would store that data in a table since I obviously can't create all the columns ahead of time if I let each column stand for a unique word (no, I'm not a DBA either).
  • We keep track of the number of selections in each category. Each category gets a ++ for each item that has been selected in a category by the user.
  • To find the top associated items for this user to look at, we query the database for their record, split the record and stuff the word::count values into a hash again. Then we compare each of the words in the title of the new item to the top N words in the user's record. We ++ a match-value for the new item for each word in its title that matches our user's records. Then we ++ the match-value if it is in a category the user has selected before.

For example, if we have an item called "really ugly petite blue pants", we would compare each word against the user's records (that we've pulled out and stuffed into a hash). We find that petite and pants match in the top of the user's records (petite and pants are some of that most frequently encountered words in the titles of items the user has selected in the past). Then we check to see if the category that the item is in matches any categories the user has previously selected items from. If it does, we add to its score. We do this for each item and then the top N items on the entire site are displayed on a page for the user to look at.

I think this would work. I'm not sure how well, but... it would work. My biggest fear is that this is a hell of a lot of processing to do! Especially if you're talking about users who may have hundreds, thousands or even tens of thousands of items in their history and a site with easily 5,000 items to match against. Imagine having to do the above steps 5,000 times and storing it all in a hash temporarily while you pick through what the top items are and build up the scores!

So I'm looking for improvements, alternatives... most any suggestions whatsoever.

Added <readmore> at author's request - dvergin 2003-06-27