Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Hi all,

First of all, what I am about to describe is not a straightforward, copy & paste solution but rather a discussion of information retrieval techniques that could apply to other programming languages as well.

In general when writing search algorithms the easiest way to optimize your code is to a create an inverted index. An inverted index is a structure which maps each keyword that you want to make searchable to a list of documents that it occurs in. Optionally together with each document you could store a corresponding score for that keyword in that document. This could be for example the number of occurences(frequency) of this keyword in the respective document.

An entry in the inverted index would look like:

keyword = (document1, score1, document2, score2....)

As you can see the big advantage of this approach is that you don't need to scan the documents every time you make a query. You only need to update the index every time some of the documents change. Even better the indexer could work incrementally and only examine files that have been modified(or created) since the last indexing time.

The other problem that has to be solved is picking a good scoring algorithm that will calculate a score for each keyword, for each document that it occurs in. The most common algorithm that is used, is due to Gerald Salton, the father of informaation theory(we don't know who the mother is though).

This states that the the weight W, of a term T, in a document D, is:

W(T, D) = tf(T, D) * log ( DN / df(T))

where tf(T, D) is the term frequency of T in D. DN is the total number of documents df(T) is the sum of frequencies of T in every document considered or as it called the document frequency of T.

The quantity

log ( DN / df(T))
is called the inverse document frequency of T.

So we can write:

W(T, D) = tf(T, D) * idf(T)

Now of course, there is a reason why all this gives good results. I will not go into detail but basically what is implied by the above formula is that the weight given to term in respect to a document is higher if:

  • it occurs many times in that document
  • it doesn't appear that often in other documents in the collection
which in simple words means that this term is distinctive for the document in question.

Quite a few variants of the above weighting algorithm exist.

An example of such a type of search engine is Perlfect Search which I have written. Perhaps if you are interested you could take a look at http://perlfect.com/freescripts/search/ and examine the code to see most of the things that I describe above.

With a few modifications it may even work for the problem that was mentioned by tenfourty.

giorgos


In reply to Re: Search Algorithm by giorgos
in thread Search Algorithm by tenfourty

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others contemplating the Monastery: (7)
    As of 2014-08-23 13:51 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The best computer themed movie is:











      Results (174 votes), past polls