Your skill will accomplishwhat the force of many cannot PerlMonks

Re: Search Algorithm

by giorgos (Novice)
 on Aug 11, 2000 at 17:15 UTC ( #27509=note: print w/replies, xml ) Need Help??

Hi all,

First of all, what I am about to describe is not a straightforward, copy & paste solution but rather a discussion of information retrieval techniques that could apply to other programming languages as well.

In general when writing search algorithms the easiest way to optimize your code is to a create an inverted index. An inverted index is a structure which maps each keyword that you want to make searchable to a list of documents that it occurs in. Optionally together with each document you could store a corresponding score for that keyword in that document. This could be for example the number of occurences(frequency) of this keyword in the respective document.

An entry in the inverted index would look like:

keyword = (document1, score1, document2, score2....)

As you can see the big advantage of this approach is that you don't need to scan the documents every time you make a query. You only need to update the index every time some of the documents change. Even better the indexer could work incrementally and only examine files that have been modified(or created) since the last indexing time.

The other problem that has to be solved is picking a good scoring algorithm that will calculate a score for each keyword, for each document that it occurs in. The most common algorithm that is used, is due to Gerald Salton, the father of informaation theory(we don't know who the mother is though).

This states that the the weight W, of a term T, in a document D, is:

W(T, D) = tf(T, D) * log ( DN / df(T))

where tf(T, D) is the term frequency of T in D. DN is the total number of documents df(T) is the sum of frequencies of T in every document considered or as it called the document frequency of T.

The quantity

`log ( DN / df(T))`
is called the inverse document frequency of T.

So we can write:

`W(T, D) = tf(T, D) * idf(T)`

Now of course, there is a reason why all this gives good results. I will not go into detail but basically what is implied by the above formula is that the weight given to term in respect to a document is higher if:

• it occurs many times in that document
• it doesn't appear that often in other documents in the collection
which in simple words means that this term is distinctive for the document in question.

Quite a few variants of the above weighting algorithm exist.

An example of such a type of search engine is Perlfect Search which I have written. Perhaps if you are interested you could take a look at http://perlfect.com/freescripts/search/ and examine the code to see most of the things that I describe above.

With a few modifications it may even work for the problem that was mentioned by tenfourty.

giorgos

Create A New User
Node Status?
node history
Node Type: note [id://27509]
help
Chatterbox?
 [erix]: except of course, the RHODESIAN LION HOUND ! :) [Discipulus]: in roma cars are good mouser.. fortunnately no mouse at 7th floor.. [GotToBTru]: that's a name that should always be accompanied by a trumper fanfare [GotToBTru]: trumpet, not trumper [GotToBTru]: good thing too, Discipulus, no roads on the 7th floor either [GotToBTru]: okay, fess up, who installed CB::Typo? [Discipulus]: at 7th floor seagulls eat mice [Discipulus]: just in Eataly: brain surgery while playing clarinet..

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (12)
As of 2017-11-17 20:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
In order to be able to say "I know Perl", you must have:

Results (272 votes). Check out past polls.

Notices?