http://www.perlmonks.org?node_id=269898

Seumas has asked for the wisdom of the Perl Monks concerning the following question:

In an earlier thread (Fuzzy matching to user's tastes?), Tilly pointed me to an article on Building a Vector Space Search Engine in Perl. It seems that if I could implement this, it would solve a number of my problems. There are a couple of downsides, however and I'm hoping that fellow monks can offer solutions to them. I have the mathematical knowledge to grasp the concept but not enough to expand on it in the way I believe I need.

For reference, here is the source code I'm working with: http://www.perl.com/2003/02/19/examples/VectorSpace.pm.

In short, a set of documents are parsed for unique words, stop-words are removed, the number of appearances of each word in each document are referenced and we build an index of vectors/documents to evaluate against our search keywords (which themselves are turned into vectors before comparing). The closer two vectors are together, the more likely they are related (or at least of interest).



That's all well and good, but I'm running into a snag because all of this (as written) occurs in memory. So every time the script is run, it's going to parse, analyze and index every document all over again. All of these documents are stored in Postgresql and there are approximately 5,000 of them at any one time (currently) and they have a turnover of several hundred per day (meaning old documents expire and new documents are created).

So I wanted to perform as much of the parsing and calculation as I could ahead of time. Perhaps a few times per day. Then store all of that in a database table that would consist of two columns. One column would be a calculated vector or score or some sort of cumulative value of some sort and the other would be the document identifier (so I know what document it is regarding).

However, if I understand all of this correctly, each document's result is not one single value, but an array of values indicating the frequency of each indexed word (each point in the array represents a word that either is or isn't there). PDL uses piddle() to reduce the size of this, but I'm not sure if it reduces it to a single string or not (I've achieved different results using Data::Dumper to figure it out).

So, the results that are returned for each document vector look something like this:

[0 0.12031612 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.024063225 + 0 0 0 0.024063225 0 0 0.024063225 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0.024063225 0 0 0 0 0 0 0 0 0 0 0 0 0 0.024063225 0 0 0 0]


And the more unique words you have in all your combined documents, the more places/values you have. In one test, I calculated that there would currently be 22,000+ values in each vector/array (most are zeroes). So if you multipley 22,000 by 5,000 documents, you're dealing with (I think) about 100MB! That's a lot to process every single time a user wants to perform a search and with thousands of users performing searches each day, this would just kill performance across the board.

So, what I'm hoping... is that someone out here will be able to offer me (in a way a stupid person could comprehend) a way to represent the same information in a far simpler manner. Instead of having a database table with a six digit document identifier and a 22,000 value vector in the other column, it would be great if I could have a six digit document identifier and a ten or twelve digit value that I could still compare vectored search queries against without additional processing.

Is this a pipe dream? Am I even explaining myself clearly? *grin*

Thanks for your time.

Replies are listed 'Best First'.
Re: Refining a 'vector space search'.
by dws (Chancellor) on Jun 28, 2003 at 16:56 UTC
    And the more unique words you have in all your combined documents, the more places/values you have. In one test, I calculated that there would currently be 22,000+ values in each vector/array (most are zeroes).

    Was that count before or after stemming? When I played with vector space searching, stemming reduced the number of "words" significantly. But yes, you do end up with big vectors.

      This was after stemming, but it was using the default set of words. Either way, with so many documents (and all of them being written by a different person -- these are auction descriptions) the variety and number of truly unique and non-stopped words is going to be really high.

      Can you (or anyone) confirm for me that my understanding is correct, in that you need the entire vector for each document to compare against your search query's vector when you cosine them? That is, you couldn't perform some function ahead of time on your vector to reduce the size of it (yet still have the same representation) and have it be compared properly to the search query vectors? Since the two are so interrelated, I don't know that there could be a way to do that since each search is unique and performing some shrinking computation on it ahead of time would destroy the ability to gather results against each different set of query words...

      Like I say, I was just sort of hoping against hope here. I need a viable solution for searching on my site because I have hundreds of thousands of old (closed) auctions, thousands of existing (open) ones and then hundreds of thousands of posts in the forums. So as you can see, simple SQL queries are really not cutting it any longer - both in speed and accuracy (not to mention flexibility). I'm not even sure that I could properly use a reverse index method on this type of setup. Aaaargh.

        You can get away with a sparse representation of the vector, i.e. you can keep just the positions of the elements of the vector that are non-zero. Say the vector looks like:

        (0, 1, 0, 0, 0, 1, 0, 0, 1, 0)
        than you can encode it by a list containing just:
        ( 1, 5, 8 )
        This can be a huge space saver if the vector is really large. It is very easy to calculate the cosine between vectors in that representation: just count the number of common elements in the lists and divide by the square root of product of the length of each of the two lists.

        Hope this helps, -gjb-

Re: Refining a 'vector space search'.
by TomDLux (Vicar) on Jun 29, 2003 at 06:34 UTC

    It's been a few years since I've done math, but checking definitions at wolfram.com convincecs me the following is correct. Just don't use the information in any life-or-death situation.

    The reason you have a value with floating point values, above, is because the vector has been normalized. But if you only care about existence, not frequency, both storage and procecssing speed can be drasticallyimproved.

      Excellent. One nit: Do you mean to take the root of the length of a vector of offsets?

      Update: Clever. Very clever. The length of the array of indexes is the same as the count of non-zero elements in the full vector.

Re: Refining a 'vector space search'.
by Anonymous Monk on Jun 29, 2003 at 07:19 UTC

    there is an idea which bounces around the data-mining community that goes like this: the most important words are repeated only a few times in a document. take an article that talks about something specific, like unit testing in perl and see how many times the words "unit", "test" and "perl" appear in the text as opposed to the words "code", "module", "method", "error", etc. so this sorta-kinda-maybe justifys the following simplification:don't count the number af times a word appears in a document, just consider if it's there or not. so now we can use a bit vector as opposed to a "real" vector, sparse bit vectors have a wonderfully small representation: a list of all the indexes at 1. while i immagine you're set of known words will be quite big odds are the set of words in each document will be small, so this reppresentation will save you a lot of space.

    anyway, you decide what the best in memory reppresentation is for your data, then i suggest you use that. this is a very specific application and one which, imho, doesn't map well onto the entity relation view rdbs have of the world. if your application can handle looong start-up times (most can) you could just use Data::Dumper to accasionaly "snapshot" the data, or maybe a prevalance-esque solution would be a good idea.

    the question is: what do you want to know about that vector? you're not really interested in the vector position in some n-dimensional space, but all you care about is it's distance from other vectors in this space. well, how do you calculate that distance? odds are you'll do what the article says and take the cosine distance, so what info do we need about a document in order to take the cosine distance? we need the magnitude of both vectors and we need the inner product. the magnitude we can pre-calculate (the square root of the sum of the elements squared, we can avoid squaring since all our elements are either one or zero, we can't avoid square rooting but we only have to do this when we insert a new document, so that's not such a bg deal). how does the inner product work? it's sum_{i=0}^{n} a_i * b_i, which in our case (we're using sparse vectors here) is just the size of the intersection of the indexes in a and b, but since those volues were inserted in order (got that: keep the list of indexes ordered) this is a quick scan through the list of indexes in a or b (use whichever is shorter).

    now the last thing we'd like to touch on is the fact that for every query we have to check it against each and every other document. you've got solutions to this: 1) query caching (i'm not going to go into this any further) 2) vector space partitioning. odds are you're going to have your documents distrubited not unifromly but into various sets of similar documents (clusters). clustering algocithms are still being researched so there's no real "answer" but the idea is this: every now and then (vague definition, i know) you go through the db and partiton it into documents which aren't orthogonal between themselves (ie sets of documents with the same words in them). then when you get a query you can limit your search to only the set of documents which must have a cos distance > 0. sorry if this last bit ins't as clear as i'd like.

    hope this helps.

Re: Refining a 'vector space search'.
by chunlou (Curate) on Jun 29, 2003 at 00:58 UTC
    As a naive way, why don't you try to collapse the unique words with high correlation into statistical synonyms?

    If a group of unique words tend to occur in the same document often. You could just choose one as representative in your vector, the others stored as synonyms for that representative. (And yes, that would make your search a two-step process.)

    As for what value asigned to a representative word, there's no theoretically the best way, but the sum of the frequency of all correlated words is not the way at all, as it will grossly bias the vector.

    An unsophisticated way to do it is to simply take the simple average frequency. A "better" way would be to use factor analysis or any dimension reduction technique in statistics to empirically figure out the weights for different words so as to come up with a weighted average that way.
Re: Refining a 'vector space search'.
by choocroot (Friar) on Jun 29, 2003 at 02:42 UTC
    If you cannot use fixed length vectors because the full set of document is too big or unknown (because it's a stream of documents) and can evolve, then you could work with hashes. You store for each documents the terms with their frequencies. In fact it's like keeping only the terms with non-zero frequencies. In a database, that could be represented with a "huge" table with document id, terms, and frequency columns:

    docid | term | freq.

    Then, for the search, you retrieve the document hash from the db, and expand it to a vector with all the terms (build the "full vector" with 'SELECT DISTINCT term FROM index_table' with everything set to zero, then place your document term/freq in it), so you can compute your cosine ...
Stemming modules
by Bluepixel (Beadle) on Jun 29, 2003 at 19:41 UTC
    Is there a module on cpan that detects the language a text is written in? I can't find one.

    I have multiple texts which are in german, french , english, protugese, etc... and I need to know which stemming language module to apply on the document.
Turning this module into a persistent web app
by davebaker (Pilgrim) on Jun 03, 2004 at 21:39 UTC
    I am so impressed with this nifty module. It is awesome for finding "similar" documents, the way the old Excite for Web Servers search engine did (the "More Like This One" button).

    The unusual aspect of this search technique is that searches become more accurate the larger the query is ... you can input the entire text of a document and the search engine returns a list of documents like it.

    I made a modification to it so that I'd see a document ID in the command-line list of results (in addition to the filename), so that I can input the document ID in order to in effect provide all the terms in that document as the new query ... the result is awesomely accurate.

    I'd love to have a web interface for this module and give it a try on a real site. I guess the first big obstacle is to turn the module into a daemon so that once all the vectors are created they could "hang around" without having to be recreated each time the search engine is used. Has anybody done any work in that regard?