It would be nice if the CPAN modules were categorised beyond their name and what is used on http://search.cpan.org/. e.g. categories such as Testing, Templating systems, XML processing, etc.

As there are so many modules there will be many categories as well so categories should probably also need to be placed in a hierarchy.

I wonder what would be the best way to approach this?

What do you think?

Replies are listed 'Best First'.
Re: Module categorization
by davorg (Chancellor) on May 25, 2006 at 12:24 UTC

    We used to have something like that. Well, at least, at a very high level. The list of categories is on the front page of search.cpan.org.

    Oops. You mentioned that already. Sorry, I didn't notice it :-/

    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

      Yes I know that, but AFAIK it is hardly used any more. Why?

      Either because there is no need for categorisation any more or because the way it is done is too difficult for new(?) module authors. It might have been too much work for the few volunteers or is just not well known to most of the new authors. Besides, there are only < 30 categories. That was good 10 years ago, but not any more, I think.

      When I find a module doing X, I would like to have a link to the other related modules so I can easily compare them.

        Coming from a library science standpoint -- it's fallen out of vogue with the general public. Things like Google and other web-search engines show that you can get a reasonable level of findability with minimal human effort through use of full-test searching.

        Unfortunately, as time is showing, it's very possible for people to game the system by inserting false or misleading terms, and search quality suffers when there are many words that describe the same concept (synonyms), or many concepts used by the same word (homonyms).

        Cataloging has its benefits, and can lead to improved findability in many situations -- but it takes work. And it requires continuous maintenace, as there has to be a group that deals with decisions about adding new categories, or pruning old categories, and work required to reclassify the existing items as the word list changes.

        So basically -- it's a lot of work. The work grows above linearly for both size and time. (I don't know that it's exponential, though, maybe for small values (1.01^size) or similar).

        Perhaps this is something that AnnoCPAN can add in, so the work can be distributed across multiple people. (but then you have to make sure that everyone is clear on exactly what the categories are, so there's consistent catalogging).

        I think it's either unmaintained or so close to unmaintained that there's no point in even asking to be added.
Re: Module categorization
by zby (Vicar) on May 25, 2006 at 14:30 UTC
    How about using tags (ie keywords) instead of categories? I have this pet theory that categories -> tags -> search is the sequence of information architectures aligned for fitting bigger and more dynamic data pools. Categories are the most restricting, but the fastest (because of good discoverability) - search the most powerfull, but slower as you need to think up the query terms and tags somewhere in between. By the way I am developing a unified tag/search engine for bookmarks (a la del.icio.us but more integrated) and I am just thinking about the possibility to use it for other things like emails etc. (hmm perl modules?).

      There are some items up from the 2006 IA Summit (Information Architecture), up at http://iasummit.org/2006/blog/, that discuss tagging.

      My issue with folksonomies over a controled vocabulary is attempting to derive useful meaning when you have ambiguity -- for instance, people might classify Mail::Procmail as any one of:

      mail
      (ambiguous ... e-mail or postal mail?)
      email
      e-mail
      (different spellings of the same term ... but is this sending or receiving email? neither, it's filtering, so it's still not specific)
      procmail
      (too specific to be a useful category in most catalogs)
      smtp
      (just wrong ... it filters mail that may have been sent through SMTP, but it has nothing directly to do with SMTP)

      That's not to say that there isn't use for tagging -- it serves as a balance between full text searching and a full catalogging, but it can be perverted by a few (putting in bogus tags, using abnormal terminology). Folksonomies are a relatively new concept in IA, and I think it has great potential, but it is not a perfect solution -- it removes some of the problems of full text searching (number of false positives), while removing some of the problems of catalogging (expense), but doesn't really bring in the advantage of either.

      I'm waiting to see how people can start data mining folksonomies to derive ontologies:

      eg, PersonA .. PersonH all use Tag1, Tag2, Tag3 on the same items that PersonI .. PersonP use Tag4, Tag5; And if PersonI .. PersonP use Tag1 on an item, then Tag1 is never assigned by PersonA .. PersonH. Therefore, we can assume that Tag1 is a homonym, with different meanings to two different groups -- And between those two groups, the intersection of Tag1/Tag2/Tag3 in one group is similar to the intersection of Tag4/Tag5 by the second group.

Re: Module categorization
by xdg (Monsignor) on May 25, 2006 at 13:58 UTC
    A database where the community can do the categorisation?

    I'm for this option, preferably wiki-ish to lower the barriers for contribution and change.

    I've thought about this before and I think trying to get to tasks rather than categories -- or hierarchies of tasks -- might be the way to go.

    A good model might be something along the lines of the Perl Cookbook or Advanced Perl Programming (2ed).

    E.g.:

    Q: How do I find all the module dependencies of another module or prog +ram? A1 -- if you are willing to execute the module or program: # discussion of those modules A2 -- if you don't want to execute the module or program: # discussion of those modules

    I think this would allow for a fuller community discussion of leading candidates and pros/cons of different modules. Part of the problem with search is the sheer number of options for some common tasks and the lack of context for what works well, what other people find useful, what is well-maintained, etc.

    cpanratings, CPANTS, annocpan and bugcounts on RT help somewhat, but those still have to be navigated module-by-module to get any detail. (Thankfully, search.cpan.org does provide some summary of information of those.)

    I've seen some attempts at this scattered around the web. Some examples include:

    It would be nice to see one centralized, sanctioned place to collect these kinds of comparisons.

    So how about it? Anyone else for cookbook.cpan.org?

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: Module categorization
by exussum0 (Vicar) on May 25, 2006 at 14:40 UTC
    Becareful with categorizing the entire world. While OOP forces a tree relationship, if you don't multi-inherit, it's not that simple outside of OOP. Using your examples, I can provide easy counter examples:

    For some reason, someone I know wrote an XML processor that generates and executes tests. I personally know XSLT, which is XML, can be used as a template system.

    I think the CPAN categories are mostly intelligent, with holes. Why switch to another intelligent system with other holes? :)

    Update: Congratulations on being the reason I posted post 500. ^^

Re: Module categorization
by ambrus (Abbot) on May 25, 2006 at 20:15 UTC

    Istead of categorizing, it might also be useful if we could assign search keywords to a module.