Before I start, please note this is not strictly a Perl question. It is more of a general algorithms questions, and that's why I posted to Meditations.

A friend of mine has come up with a neat algorithm for creating a hierarchy from a messy bag of keywords. I was wondering if this was something novel or not, so I thought I would ask the Monks if anyone recognizes it.

Another reason I'm posting is simply because I like this algorithm. Whether it's new or not, it's simple and intuitive. It creates hierarchies in a way that seems natural and "right." I would like to hear other people's thoughts about it. Maybe there are some things I am overlooking.

The idea is to start with a list of items that have keywords assigned to them. The keywords are not necessarily chosen with a hierarchy in mind, yet by grouping commonly used keywords, a hierarchy can be synthesized. It goes something like this:

for each item in items: for each keyword in item.keywords: frequency{keyword}++ for each item in items: sort keywords by frequency push hierarchy{sorted keywords}, item

Thus, given the following items:

item = 1, keywords = a b c item = 2, keywords = b c item = 3, keywords = d b

The frequencies would be:

a = 1, b = 3, c = 2, d = 1

And the hierarchy would end up looking like:

b/c/2 b/c/a/1 b/d/3

Here's my quick and dirty Perl implementation, so there's something to actually play with:

my @items; push @items, { name => 1, keywords => [ qw( b ) ] }, { name => 2, keywords => [ qw( b c ) ] }, { name => 3, keywords => [ qw( b c ) ] }, { name => 4, keywords => [ qw( b d ) ] }, { name => 5, keywords => [ qw( e ) ] }, { name => 6, keywords => [ qw( e f ) ] }; my %freq; for (@items) { for (@{ $_->{keywords} }) { $freq{$_}++; } } my @nodes; for (@items) { my @keywords = sort { $freq{$b} <=> $freq{$a} } @{ $_->{keywords} }; my $path = join "/", @keywords, $_->{name}; push @nodes, $path; } print "$_\n" for sort @nodes

Update: funny that The Mad Hatter should mention tags. This was originally made to parse the tags from a dump and create a hierarchy of links from it. I used "keywords" instead of "tags" because I thought it would be a more generic term for essentially the same thing. :-)

Replies are listed 'Best First'.
Re: Create hierarchies from list of keywords -- well known algorithm?
by blokhead (Monsignor) on Feb 24, 2005 at 01:19 UTC
    This is a bit like decision tree learning from AI theory. In learning theory, you would like the computer to be able to classify a large set of data. Each data point has some (known) attributes that you can test. You give the computer a small set of example items (a training set), for which you know their attributes and their final classification. In decision tree learning, you want to use this training set to build a "good" decision tree (which you can think of as a flow-chart) for classifying all the data.

    One good way to build decision trees is by applying information theory. Each test of an attribute (i.e, decision point in a flowchart) partitions the dataset in such a way that the uncertainty ("entropy," in information theory terms) is reduced. So when deciding which attribute to test first, you simply pick the test which decreases the entropy the most. Then repeat this process for the two branches of the decision point. All that's left is to work out the pesky math details (which aren't all that bad). If this made no sense, a very nice demo of building a decision tree is at this site.

    So while this isn't *exactly* what you're doing here, I think it's related. It's a more general way to find a good hierarchical decomposition of a dataset's attributes. In the case of decision trees, the goal is to have a concise flowchart to categorize data, but I'm not sure what the goal is in your case. You seem to be analyzing more of an interrelation between the attributes. But perhaps one could apply concepts from decision trees to your situation.


Re: Create hierarchies from list of keywords -- well known algorithm?
by zby (Vicar) on Feb 24, 2005 at 12:36 UTC
    Not all keyword assignments would constitute a hierarchy. For example take just three elements with two keywords (let say 'a' and 'b'). One element with both keywords and two others with a different one of them. Using your algorithm you would get "a/1, a/b/2, /b/3", this is not hierarchy because you can find b directly under root and under 'a'.

    Keywords assignements constitute only a partial order, to get a tree you need the additional property of well orderedness - see Tree Set Theoretic.

    By the way I belive tags are a fad. What is really needed is full text search for bookmarks (or better for whole browser history). More arguments on my wiki: Tags Or Searchable Cache. I am thinking about writing a web site engine for that, this should not be very complicated.

      Not all keyword assignments would constitute a hierarchy.

      Yes, I actually noticed that after posting the Meditation. I ended up with some bookmarks arranged similarly to:

      • Perl
        • Web
          • Programming
            • Foo
            • Bar
      • Web
        • Programming
          • Baz
          • Qux

      I'm not sure what you would call this. After reading the links in your post, I'd be inclined to say it's a hierarchy, but not a well-ordered one. Or maybe a hierarchy is by definition well-ordered, and this is just a crude imitation of one? In any case, it isn't ideal. Perhaps this problem could be solved by counting the frequency of groups rather than just individual tags/keywords?

      By the way I belive tags are a fad. What is really needed is full text search for bookmarks (or better for whole browser history).

      Perhaps for bookmarks, full text search would be a good way to find a specific bookmark a year after you added it. But that doesn't convince me that keywords or tags are a "fad" that will prove useless. They are a nice way of classifying things, and sometimes classification is actually necessary. Consider an archive of photos. You can't really search the "full body" of a photo, so some sort of classification is required.

Re: Create hierarchies from list of keywords -- well known algorithm?
by Roy Johnson (Monsignor) on Feb 24, 2005 at 03:00 UTC
    This is also very reminiscent of Condorcet voting.

    Caution: Contents may have been coded under pressure.
Re: Create hierarchies from list of keywords -- well known algorithm?
by The Mad Hatter (Priest) on Feb 24, 2005 at 00:35 UTC
    This might work well with tags, the latest (and pretty interesting) categorization method in social software.