Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Search Algorithm

by giorgos (Novice)
on Aug 11, 2000 at 17:15 UTC ( [id://27509]=note: print w/replies, xml ) Need Help??


in reply to Search Algorithm

Hi all,

First of all, what I am about to describe is not a straightforward, copy & paste solution but rather a discussion of information retrieval techniques that could apply to other programming languages as well.

In general when writing search algorithms the easiest way to optimize your code is to a create an inverted index. An inverted index is a structure which maps each keyword that you want to make searchable to a list of documents that it occurs in. Optionally together with each document you could store a corresponding score for that keyword in that document. This could be for example the number of occurences(frequency) of this keyword in the respective document.

An entry in the inverted index would look like:

keyword = (document1, score1, document2, score2....)

As you can see the big advantage of this approach is that you don't need to scan the documents every time you make a query. You only need to update the index every time some of the documents change. Even better the indexer could work incrementally and only examine files that have been modified(or created) since the last indexing time.

The other problem that has to be solved is picking a good scoring algorithm that will calculate a score for each keyword, for each document that it occurs in. The most common algorithm that is used, is due to Gerald Salton, the father of informaation theory(we don't know who the mother is though).

This states that the the weight W, of a term T, in a document D, is:

W(T, D) = tf(T, D) * log ( DN / df(T))

where tf(T, D) is the term frequency of T in D. DN is the total number of documents df(T) is the sum of frequencies of T in every document considered or as it called the document frequency of T.

The quantity

log ( DN / df(T))
is called the inverse document frequency of T.

So we can write:

W(T, D) = tf(T, D) * idf(T)

Now of course, there is a reason why all this gives good results. I will not go into detail but basically what is implied by the above formula is that the weight given to term in respect to a document is higher if:

  • it occurs many times in that document
  • it doesn't appear that often in other documents in the collection
which in simple words means that this term is distinctive for the document in question.

Quite a few variants of the above weighting algorithm exist.

An example of such a type of search engine is Perlfect Search which I have written. Perhaps if you are interested you could take a look at http://perlfect.com/freescripts/search/ and examine the code to see most of the things that I describe above.

With a few modifications it may even work for the problem that was mentioned by tenfourty.

giorgos

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://27509]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-12-09 03:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which IDE have you been most impressed by?













    Results (53 votes). Check out past polls.