|Welcome to the Monastery|
algorithm for 'best subsets'by halley (Prior)
|on Mar 03, 2005 at 00:44 UTC||Need Help??|
halley has asked for the
wisdom of the Perl Monks concerning the following question:
I have a data structure which is similar to the following. The actual structure is not important, just that we're dealing with a number of "sets" of keywords, some of which are shared between items, and many of which are not.
Each item's keywords are unique. The arrays could just as easily be keys of hashes, but in this case, they just happen to be stored in arrays.
I can pretty easily scan the set of items for those with the most keywords, or scan for the set of keywords which appear in the most items.
What I would like to do is to figure out which pairs of keywords or triples of keywords or n-ples of keywords are shared by the most items. For example, if keyword A is found in ten items, and keyword B is found in twenty items, but only 1 item has both A and B, then it's a weak candidate. If nine items have both A and B, then it's a great candidate.
In Google terms, think of it this way: from the database of websites, what would be the best four-word query to include the most results links? And then the second-best four-word query? And the third-best four-word query? And so on...
Ideas? Does this align with some obscure or obvious algorithm I couldn't recognize?
Update: A test-case-generator function has been added below. Use this data if you'd like to do some benchmarking on your own.
Update 2: A sample output from an early timeline scanner has been added below.