Just another Perl shrine | |
PerlMonks |
Similarity searchingby baxy77bax (Deacon) |
on Jan 25, 2014 at 15:07 UTC ( [id://1072061]=perlquestion: print w/replies, xml ) | Need Help?? |
baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:
Hi This is more of a discussion question then a code seeker. First i'll explain the problem on a small case scenario and then provide additional real size input parameters.Problem: I have a database of histograms like this: each histogram has up to 5 columns. each column corresponds to a value form 1 to 25. Moreover I'll refer to those histograms as subject hist. On the other hand i have a set of query histograms which is approx. 1000 times smaller then the subject set. What i need to do is: for each query histogram find the closest subject histogram. how do i define "closest"? well closest refers to the number of shared data points (# -> data point) so let say i have histogram x that looks like this: then (x,1) => share 7 data points (x,2) => share 9 data points (x,3) => share 1 data point from the above the closest to x is histogram 2. Since the number of subject histograms is to large that i cannot put it into memory (approx 2x10^9 histograms - each in real case scenario containing approx 300 columns and there is a maximum of 8000 possible column labels(values)) i thought i should create a consensus histogram from all subject ones. such that this histogram has all 25 columns (now i am again talking about my example) and each column having the maximum number of data points (this is computed from the subject set- if the max number of data points for column 1 is 100 then this how large column 1 in my consensus hist will be.) now i compare each subject hist to my consensus hist and get some similarity score. i keep those scores in memory (i believe i have enough memory for that) if two or more subject histograms have the same score then i recursively repeat the procedure. once completed, i take each histogram from my query set and compare it to consensus histogram and then calculate the similarity score. from triangular inequality it follows that the closest sequence in the subject (the one that has the score closest to my d(query,cons)) set will not have a score when compared directly to my query hist larger then: d(query,cons)+d(subj,cons). which will allow me to pick a part of the subject set more carefully and therefore find the most similar subject histogram to each of my query histograms. the real number of query histograms is between 100000 and 500000. so this is the best i could think of (read about). does someone know a better way. i believe i could improve my procedure if i could somehow cheat on similarity score computation, because now the problem is that the distance between query and subject is less or equal to dist.(query, consensus) + dist.(subject, consensus). and this is because scoring does not depend on the position. d(q,c) = 5 d(s,c) = 5 d(q,s) = 0 (d = similarity score) also speed is of the highest importance!!!! thank you PS I cannot use any known database engine
Back to
Seekers of Perl Wisdom
|
|