Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Often it is helpful or more instructive to examine aggregated or otherwise summarized data en lieu of the raw data set. However, determining the best means of doing so is not always evident, and can strongly influence the outcome. For instance, given the rated maximum occupancies for a bunch of rooms, what would be the best way to divide the range of values into classes? Quantiles (equal number of members in each class)? Nice round or culturally meaningful numbers (12, 25, 50, 75, 100)? There are in fact several algorithmic means of addressing this problem, known as clustering. One of the more common/robust is K-means, also known as Jenks natural breaks (especially amongst cartographers). Outside of select circles K-means seems to be rather unheard of, which is surprising since it is so powerful and general.

For the math monks, a formula and description of the algorithm are available over there. Alas, I'm not able to fully grok the description and have been unable to tackle implementing it in perl *. I've come across a couple Fortran and VB implementations; although neither language is very perl-like, and thusly would not be well suited for translation. Would anyone be interested in taking up the challenge of writing an N-D or 1-D implementation in perl with a simple interface in perl? i.e; accept a reference to/list of the values to classify and the number of desired classes** and spit back the classified values or class-divisions.

happy hacking!

P.S. For an implementation reference see Milligan's. I cannot attest to the quality of the Fortran but the README can provide some interesting insights as well.

P.P.S. I inquired about this in the cb and discussed it with theorbtwo and atcroft, mentioning it in passing today Limbic~Region urged me to post it as a potentially interesting diversion for some.

* There is in fact a wrapper for a C implementation however it lacks documentation, seems to require lots of unusual extras and is oriented towards clustering 2-D data.

** The number of classes can influence the interpretations of the resulting analysis however, at least in 1-D, there are relatively few meaningful values and so it is easy enough to test them by hand for bias. Typical values are 3-6, with many implementations defaulting to 5. There are many reasons for this:

  1. for 2 classes it'd be easier to use the mean
  2. larger numbers of classes are difficult to handle visually. If you insist on 8+ classes you are probably better off with an even gradient of divisions.

--
In Bob We Trust, All Others Bring Data.


In reply to Making sense of data: Clustering OR A coding challenge by belg4mit

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others musing on the Monastery: (7)
    As of 2014-11-29 06:43 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My preferred Perl binaries come from:














      Results (203 votes), past polls