|Just another Perl shrine|
top-k queriesby hnd (Scribe)
|on Aug 12, 2009 at 19:17 UTC||Need Help??|
the top-k queries....
i've been studying about them for a while now and made a perl code for that.....
so um.... lets get into it....
we have to search a massive database efficiently and provide accurate result
let us assume that there a 'm' number of lists each having some data. let us also assume a set D of 'n' data items.
any list 'i' contains the data items in the form (d,s(d)) where s(d) is the score of data item 'd' in list 'i'. the lists are sorted in descending order of these scores. the data items maybe redundant in the lists (ie the same data item can occur in all the lists), but the score of the redundant may not be the same in the lists.
this collection of sorted lists is called a database.
the way to search the database
there are two ways to search the database:
- the Fagin's Algorithm (FA)
- the Threshold Algorithm (TA)
i implemented the FA in perl :D
the Fagin's Algorithm
the basic steps of the FA are as follows:
- scan all the lists in parallel and maintain a set S that has all the scanned items are contained in it. if there are atleast k data items in the set S then stop the search.
- then for each item in the set S do random access in the database and calculate the score of each data item. make another set lets say S' and insert the data items of set S into S' along with the total score of each data item in descending order
- return the set S'.
a perl implementation of FA (this is just the sub and not the whole program)
the sub is quite simple... you have to (as i have assumed) search parallely in the given three lists until a data member is found such that all the three lists contain that member. the lists are in the form of hash with keys as data and values as their local score.
my next meditation would be that sub as i've not yet completed it (only a few lines of code remaining)
sorry monks but i was really eager to post this... the sub follows in a jiffy...
the Threshold Algorithm
the basic difference between the TA and FA is in their stopping mechanism, it would be clear after reading the algorithm.
- do sorted access in each list. As a data item d is seen under sorted access in some list, do random access to the other lists to find the local score of d in every list, and compute the overall score of d. Maintain in a set S' the k seen data items whose overall scores are the highest among all items seen so far.
- for each list L let s be the last local score seen under sorted access in L. define the threshold to be t=f(s1...sn). if Y invlovevs k data items whose overall scores are higher than or equal to t, then stop doing sorted access to the lists or else go to 1.
- return Y
this algorithm is bit tougher than FA. it stops according to point number 2. i tried it but it went all above my small head.
the scoring function mentioned above calculates the cummilative score of a data item in the final list. this is calculated by adding or doing something else with the localm scores of each data item. google and other search engines use the FA to produce the results. google uses the PageRank algorithm to calculate the page rank of a link and stores it in the database with the link.
the code for the FA... i made this so it requires some checking.... though i've run this but still....
i'am worst at what i do best and for this gift i fell blessed...
i found it hard it's hard to find well whatever