|Do you know where your variables are?|
Often it is helpful or more instructive to examine aggregated or otherwise summarized data en lieu of the raw data set. However, determining the best means of doing so is not always evident, and can strongly influence the outcome. For instance, given the rated maximum occupancies for a bunch of rooms, what would be the best way to divide the range of values into classes? Quantiles (equal number of members in each class)? Nice round or culturally meaningful numbers (12, 25, 50, 75, 100)? There are in fact several algorithmic means of addressing this problem, known as clustering. One of the more common/robust is K-means, also known as Jenks natural breaks (especially amongst cartographers). Outside of select circles K-means seems to be rather unheard of, which is surprising since it is so powerful and general.
For the math monks, a formula and description of the algorithm are available over there. Alas, I'm not able to fully grok the description and have been unable to tackle implementing it in perl *. I've come across a couple Fortran and VB implementations; although neither language is very perl-like, and thusly would not be well suited for translation. Would anyone be interested in taking up the challenge of writing an N-D or 1-D implementation in perl with a simple interface in perl? i.e; accept a reference to/list of the values to classify and the number of desired classes** and spit back the classified values or class-divisions.
P.S. For an implementation reference see Milligan's. I cannot attest to the quality of the Fortran but the README can provide some interesting insights as well.
* There is in fact a wrapper for a C implementation however it lacks documentation, seems to require lots of unusual extras and is oriented towards clustering 2-D data.
In reply to Making sense of data: Clustering OR A coding challenge by belg4mit