in reply to Re: Refining a 'vector space search'. in thread Refining a 'vector space search'.
This was after stemming, but it was using the default set of words. Either way, with so many documents (and all of them being written by a different person  these are auction descriptions) the variety and number of truly unique and nonstopped words is going to be really high.
Can you (or anyone) confirm for me that my understanding is correct, in that you need the entire vector for each document to compare against your search query's vector when you cosine them? That is, you couldn't perform some function ahead of time on your vector to reduce the size of it (yet still have the same representation) and have it be compared properly to the search query vectors? Since the two are so interrelated, I don't know that there could be a way to do that since each search is unique and performing some shrinking computation on it ahead of time would destroy the ability to gather results against each different set of query words...
Like I say, I was just sort of hoping against hope here. I need a viable solution for searching on my site because I have hundreds of thousands of old (closed) auctions, thousands of existing (open) ones and then hundreds of thousands of posts in the forums. So as you can see, simple SQL queries are really not cutting it any longer  both in speed and accuracy (not to mention flexibility). I'm not even sure that I could properly use a reverse index method on this type of setup. Aaaargh.
Re: Re: Re: Refining a 'vector space search'.
by gjb (Vicar) on Jun 28, 2003 at 17:35 UTC

You can get away with a sparse representation of the vector, i.e. you can keep just the positions of the elements of the vector that are nonzero. Say the vector looks like:
(0, 1, 0, 0, 0, 1, 0, 0, 1, 0)
than you can encode it by a list containing just:
( 1, 5, 8 )
This can be a huge space saver if the vector is really large. It is very easy to calculate the cosine between vectors in that representation: just count the number of common elements in the lists and divide by the square root of product of the length of each of the two lists.
Hope this helps, gjb  [reply] [d/l] [select] 

Is it enough to represent the values in 0/nonzero like that? That is, do we just need to know that "places 4, 11, 145, 519, 1238 are all nonzero"? I mean, in the vector I displayed as the example above, you see that they are broken down into what appears to be cos themselves(?) and everything is either 0 or a point between 0 and 1. So when performing a calculation on your above example, would "places 1,5 and 8 are positive" be just as fruitful as "place 1 is 0.02453, place 5 is 0.42128 and place 8 is 0.242112"?
 [reply] 

This depends on the algorithm. Some information retrieval algorithms just work with boolean values, others keep track of the frequency of a term in a document.
If you want to keep track of the frequencies, you can either store position/frequency pairs or use two lists, one for the position, the other for the frequencies. The former approach is cleaner, the latter should be faster.
Hope this helps, gjb
 [reply] 

