Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Hi all,

First of all, what I am about to describe is not a straightforward, copy & paste solution but rather a discussion of information retrieval techniques that could apply to other programming languages as well.

In general when writing search algorithms the easiest way to optimize your code is to a create an inverted index. An inverted index is a structure which maps each keyword that you want to make searchable to a list of documents that it occurs in. Optionally together with each document you could store a corresponding score for that keyword in that document. This could be for example the number of occurences(frequency) of this keyword in the respective document.

An entry in the inverted index would look like:

keyword = (document1, score1, document2, score2....)

As you can see the big advantage of this approach is that you don't need to scan the documents every time you make a query. You only need to update the index every time some of the documents change. Even better the indexer could work incrementally and only examine files that have been modified(or created) since the last indexing time.

The other problem that has to be solved is picking a good scoring algorithm that will calculate a score for each keyword, for each document that it occurs in. The most common algorithm that is used, is due to Gerald Salton, the father of informaation theory(we don't know who the mother is though).

This states that the the weight W, of a term T, in a document D, is:

W(T, D) = tf(T, D) * log ( DN / df(T))

where tf(T, D) is the term frequency of T in D. DN is the total number of documents df(T) is the sum of frequencies of T in every document considered or as it called the document frequency of T.

The quantity

log ( DN / df(T))
is called the inverse document frequency of T.

So we can write:

W(T, D) = tf(T, D) * idf(T)

Now of course, there is a reason why all this gives good results. I will not go into detail but basically what is implied by the above formula is that the weight given to term in respect to a document is higher if:

  • it occurs many times in that document
  • it doesn't appear that often in other documents in the collection
which in simple words means that this term is distinctive for the document in question.

Quite a few variants of the above weighting algorithm exist.

An example of such a type of search engine is Perlfect Search which I have written. Perhaps if you are interested you could take a look at and examine the code to see most of the things that I describe above.

With a few modifications it may even work for the problem that was mentioned by tenfourty.


In reply to Re: Search Algorithm by giorgos
in thread Search Algorithm by tenfourty

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    [choroba]: Email::Address DoS
    [Corion]: (that module has been deprecated by its author already, so that's fair. Although I wonder why the backtracking can't be fixed to handle the formfeeds gracefully)
    [choroba]: not enough tuits?
    [Corion]: choroba: Yeah, maybe. I'm also unaware of who uses Email:: modules, but that's more my limited horizon of things ;)
    [Corion]: Ah - there even is the replacement of Email::Address::XS , by the bug reporter, which hopefully fixes this bug already ;)

    How do I use this? | Other CB clients
    Other Users?
    Others contemplating the Monastery: (9)
    As of 2018-06-20 12:00 GMT
    Find Nodes?
      Voting Booth?
      Should cpanminus be part of the standard Perl release?

      Results (116 votes). Check out past polls.