Perl: the Markov chain saw PerlMonks

### Comment on

 Need Help??
Hi all,

First of all, what I am about to describe is not a straightforward, copy & paste solution but rather a discussion of information retrieval techniques that could apply to other programming languages as well.

In general when writing search algorithms the easiest way to optimize your code is to a create an inverted index. An inverted index is a structure which maps each keyword that you want to make searchable to a list of documents that it occurs in. Optionally together with each document you could store a corresponding score for that keyword in that document. This could be for example the number of occurences(frequency) of this keyword in the respective document.

An entry in the inverted index would look like:

keyword = (document1, score1, document2, score2....)

As you can see the big advantage of this approach is that you don't need to scan the documents every time you make a query. You only need to update the index every time some of the documents change. Even better the indexer could work incrementally and only examine files that have been modified(or created) since the last indexing time.

The other problem that has to be solved is picking a good scoring algorithm that will calculate a score for each keyword, for each document that it occurs in. The most common algorithm that is used, is due to Gerald Salton, the father of informaation theory(we don't know who the mother is though).

This states that the the weight W, of a term T, in a document D, is:

W(T, D) = tf(T, D) * log ( DN / df(T))

where tf(T, D) is the term frequency of T in D. DN is the total number of documents df(T) is the sum of frequencies of T in every document considered or as it called the document frequency of T.

The quantity

`log ( DN / df(T))`
is called the inverse document frequency of T.

So we can write:

`W(T, D) = tf(T, D) * idf(T)`

Now of course, there is a reason why all this gives good results. I will not go into detail but basically what is implied by the above formula is that the weight given to term in respect to a document is higher if:

• it occurs many times in that document
• it doesn't appear that often in other documents in the collection
which in simple words means that this term is distinctive for the document in question.

Quite a few variants of the above weighting algorithm exist.

An example of such a type of search engine is Perlfect Search which I have written. Perhaps if you are interested you could take a look at http://perlfect.com/freescripts/search/ and examine the code to see most of the things that I describe above.

With a few modifications it may even work for the problem that was mentioned by tenfourty.

giorgos

In reply to Re: Search Algorithm by giorgos
in thread Search Algorithm by tenfourty

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

• Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
• Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
• Read Where should I post X? if you're not absolutely sure you're posting in the right place.
• Please read these before you post! —
• Posts may use any of the Perl Monks Approved HTML tags:
a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
• You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
 For: Use: & & < < > > [ [ ] ]
• Link using PerlMonks shortcuts! What shortcuts can I use for linking?
• See Writeup Formatting Tips and other pages linked from there for more info.

Create A New User
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2018-04-25 21:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
My travels bear the most uncanny semblance to ...

Results (94 votes). Check out past polls.

Notices?