Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Making sense of data: Clustering OR A coding challenge

by belg4mit (Prior)
on Apr 03, 2006 at 18:38 UTC ( #541000=perlquestion: print w/ replies, xml ) Need Help??
belg4mit has asked for the wisdom of the Perl Monks concerning the following question:

Often it is helpful or more instructive to examine aggregated or otherwise summarized data en lieu of the raw data set. However, determining the best means of doing so is not always evident, and can strongly influence the outcome. For instance, given the rated maximum occupancies for a bunch of rooms, what would be the best way to divide the range of values into classes? Quantiles (equal number of members in each class)? Nice round or culturally meaningful numbers (12, 25, 50, 75, 100)? There are in fact several algorithmic means of addressing this problem, known as clustering. One of the more common/robust is K-means, also known as Jenks natural breaks (especially amongst cartographers). Outside of select circles K-means seems to be rather unheard of, which is surprising since it is so powerful and general.

For the math monks, a formula and description of the algorithm are available over there. Alas, I'm not able to fully grok the description and have been unable to tackle implementing it in perl *. I've come across a couple Fortran and VB implementations; although neither language is very perl-like, and thusly would not be well suited for translation. Would anyone be interested in taking up the challenge of writing an N-D or 1-D implementation in perl with a simple interface in perl? i.e; accept a reference to/list of the values to classify and the number of desired classes** and spit back the classified values or class-divisions.

happy hacking!

P.S. For an implementation reference see Milligan's. I cannot attest to the quality of the Fortran but the README can provide some interesting insights as well.

P.P.S. I inquired about this in the cb and discussed it with theorbtwo and atcroft, mentioning it in passing today Limbic~Region urged me to post it as a potentially interesting diversion for some.

* There is in fact a wrapper for a C implementation however it lacks documentation, seems to require lots of unusual extras and is oriented towards clustering 2-D data.

** The number of classes can influence the interpretations of the resulting analysis however, at least in 1-D, there are relatively few meaningful values and so it is easy enough to test them by hand for bias. Typical values are 3-6, with many implementations defaulting to 5. There are many reasons for this:

  1. for 2 classes it'd be easier to use the mean
  2. larger numbers of classes are difficult to handle visually. If you insist on 8+ classes you are probably better off with an even gradient of divisions.

In Bob We Trust, All Others Bring Data.

Comment on Making sense of data: Clustering OR A coding challenge
Replies are listed 'Best First'.
Re: Making sense of data: Clustering OR A coding challenge
by kvale (Monsignor) on Apr 03, 2006 at 19:32 UTC
    K-means is not esoteric within the clustering community; in fact, it is the first method people turn to because it is easy to implement. Unfortunately, it is also among the weakest of the adaptive clustering methods. It gets stuck in local minima easily, as this program will show if you run it multiple times:
    #!/usr/bin/perl use warnings; use strict; my $num_clust = 4; # number of clusters my $tol = 0.001; # stopping tolerance # my @data = map {rand} 1..100; my @data = (0.1, 0.15, 0.3, 0.35, 0.5, 0.55, 0.7, 0.75); # initialize by choosing random points the data my @center = @data[ map {rand @data} 1..$num_clust ]; my $diff; do { $diff = 0; # Assign points to nearest center my @cluster; foreach my $point (@data) { my $closest = 0; my $dist = abs $point - $center[ $closest ]; for my $idx (1..$#center) { if (abs $point - $center[ $idx ] < $dist) { $dist = abs $point - $center[ $idx ]; $closest = $idx; } } push @cluster, [$point, $closest]; } # compute new centers foreach my $center_idx (0..$#center) { my @members = grep {$_->[1] == $center_idx} @cluster; my $sum = 0; foreach my $member (@members) { $sum += $member->[0]; } my $new_center = @members ? $sum / @members : $center[ $center +_idx ]; $diff += abs $center[ $center_idx ] - $new_center; $center[ $center_idx ] = $new_center; } } while ($diff > $tol); print "Centers are:\n"; foreach my $center_idx (0..$#center) { print "$center_idx $center[ $center_idx ]\n"; }


Re: Making sense of data: Clustering OR A coding challenge
by srdst13 (Pilgrim) on Apr 04, 2006 at 01:27 UTC
    For most problems like this, I move over to using a programming language with more context-specific tools like R. At least a dozen clustering methods are available and there are facilities for estimating the "true" number of clusters. Using it is fairly simple from perl. Just write a little script that you call from the command line that reads a file of numbers that you supply. The output can be graphical or some numeric summary--whatever suits your needs. You can also use the perl module, Statistics::R to interact without the intermediate running of an R script from the commmand line.

      I am quite interested in this. Can you please offer Perl scripts that does this by way of example. Are the graphs that you get from R really much better than Excel? Is there a web page that demonstrates that graphical capabilities of R?

        "Is there a web page that demonstrates that graphical capabilities of R?"

        For graphical examples using R, check out this website. If there are any questions as to whether R can generate better graphs than Excel, that website should answer them.

Re: Making sense of data: Clustering OR A coding challenge
by codeacrobat (Chaplain) on Apr 04, 2006 at 06:04 UTC
    Something similar to k-means can be archieved with Statistics::Descriptive and its frequency_distribution.
    use Statistics::Descriptive; $stat = Statistics::Descriptive::Full->new(); $stat->add_data( split /,/, <DATA> ); %f = $stat->frequency_distribution(4); $min = 0; for ( sort { $a <=> $b } keys %f ) { printf "[%d\t-%d\t] %d\n", $min, $_, $f{$_}; $min = $_ + 1; } __DATA__ 0, 12, 25, 38, 50, 62, 75, 88, 100 __END__ # prints [0 -25 ] 3 [26 -50 ] 2 [51 -75 ] 2 [76 -100 ] 2
      Interesting, although it seems this method seems to favor outliers. With my sample dataset below it creates more bins with few or no items on the high end of the spectrum whereas nearly everything else continues to be lumped into the first category.
      [0 -296 ] 86 [297 -580 ] 8 [581 -864 ] 4 [865 -1148 ] 1 [1149 -1432 ] 2 [1433 -1716 ] 0 [1717 -2000 ] 1
      vs. kvale's
      [12 -24 ] 33 [42 -76 ] 27 [80 -128 ] 14 [150 -250 ] 9 [280 -460 ] 9 [550 -950 ] 7 [1226 -2000 ] 3

      In Bob We Trust, All Others Bring Data.

        Then you might be interested in the
        notation and add more bins at lower values.
Re: Making sense of data: Clustering OR A coding challenge
by atcroft (Monsignor) on Apr 03, 2006 at 19:34 UTC

    I don't know if this observation is that helpful, but to me it almost seems as if in this process you are trying to construct the smallest k n-dimensional "spheres" that encompass the data. (I say "spheres" loosely-a spheroid or ellipsoid might be more appropriate, upon thinking a little further about it.) When projected into one dimension, this would seem to appear as a range; in 2 dimensions, as a circle; and in 3+ dimensions, as a sphere. The "amorphous blob" effect you mentioned in the CB when I made this observation earlier could be a result of overlaps in the spheres, and choosing one to encompass a particular data point than another.

    As I said, I don't know how helpful the observation may be, but hope it helps.

      See the screen shot here.

      In Bob We Trust, All Others Bring Data.

Re: Making sense of data: Clustering OR A coding challenge
by monkey_boy (Priest) on Apr 04, 2006 at 10:23 UTC
    Have you tried Algorithm::Cluster?, it does k-means clustering, specifically look at the kcluster method.

    This is not a Signature...
      Yes, and as I mentioned in the OP it is totally undocumented, etc. etc.

      In Bob We Trust, All Others Bring Data.

Re: Making sense of data: Clustering OR A coding challenge
by planetscape (Chancellor) on Apr 04, 2006 at 16:29 UTC
      General information (includning ppm installation information) about Algorithm::Cluster can be found here. Reasonably good documentation can be found in this pdf document. Unfortunately, as the OP noted, it does seem to be aimed at 2 variable problems.
      Thanks, that seems to agree with what I gleaned from reading the source. Although it does imply that weight and mask are optional. Alas, it provides no insight into 1-D klustering (I get no errors nor values back from kcluster).

      In Bob We Trust, All Others Bring Data.

Re: Making sense of data: Clustering OR A coding challenge
by jdporter (Canon) on Apr 04, 2006 at 16:45 UTC

    So here's a cluster finder which uses a Genetic Algorithm based approach. I've tested it with trivial data point sets that have very clear (to human eyes) clusters. On these, it finds perfect clustering almost every time. I'd be interested to see how well it works on real-world data.

    The parameters are not well isolated in this code. They are:

    • Maximum number of clusters (the algorithm is free to find fewer)
    • Number of generations to run
    • Number of individuals to kill, clone, and mutate in each generation
    • In a mutation, how many datapoints to re-cluster

    We're building the house of the future together.
      With 5 clusters and the dataset in Re^2: Making sense of data: Clustering OR A coding challenge I get some nasty results like: With 5000 generations, 50 cullings and 20 spawn it yields slightly more reasonable, though still not too helpful results: Given another order of magnitude or more runtime it might reach palatable results ;-)

      In Bob We Trust, All Others Bring Data.

        Indeed, there were a number of other parameters that could be tweaked, and by doing so, I was able to get better results than that.

        However, in the end, it turns out there are some special properties of your problem that allow much simpler and more effective solutions. Namely, the fact that your data points are one-dimensional. (I'm assuming they are.)

        It means, for example, that (1 3),(2 4) is never an optimal clustering. Neither is (1 2),(1 2).

        From these, we can define the following constraints on clusters:

        1. clusters are simply segments within the ordered data set.
        2. all repetitions of a number must be kept together.

        In the following solution, I'm using variance as a measure of the "coherence" or "binding strength" or whatever you want to call it within each cluster. You could use other measures; I wouldn't be surprised if variance doesn't necessarily give the best results. I've seen discussions of clustering that talk about maximizing variance between clusters, in addition to minimizing it within clusters. I have my doubts that that would help in a simple one-dimensional problem like this one.

        We're building the house of the future together.
Re: Making sense of data: Clustering OR A coding challenge
by bart (Canon) on Apr 08, 2006 at 22:45 UTC
    You pointed to a k-Means Clustering tutorial by Kardi Teknomo, which includes a demo program in VB. I had promised you that I'd try to port the algorithm to Perl... I found the VB program so illustrative, that I ported the whole program to Perl/Tk, even though I've never actually used Tk for anything real. As a result, it took me a few days.

    You can find the demo program in CUFP, in k-Means Clustering demo program with Tk.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://541000]
Approved by kutsu
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2015-08-05 02:56 GMT
Find Nodes?
    Voting Booth?

    The oldest computer book still on my shelves (or on my digital media) is ...

    Results (76 votes), past polls