Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Hi zli034,

It is possible to do cluster analysis in Perl. You could use the functions in the Algorithm::Cluster module that was mentioned before or you could write your own clustering functions using the PDL module that was also mentioned before. However, before getting deep into those modules I recommend you to have a look at the Wikipedia entry on Data Clustering.

You said:

I want to calculate standard deviation and mean for data array, then use the standard deviation and mean to perform cluster analysis. The aim is to group similar arrays in to one cluster.

There are two assumptions in your comment:

  1. your data follow a normal distribution. Are you sure of that? Are you taking into account the presence of outliers. Outliers will affect negatively your computation of the mean. So you have to think about that before using mean and standard deviation as the basis for clustering. There are also other options for reducing the dimensionality of your data such as principal component analysis and independent component analysis. You might want to have a look at those too
  2. you want to have only one or two clusters. Actually, I am not sure if that is what you meant but that is what I understood from your comment. In any case, if your data is highly dimensional, you could try principal component analysis to reduce your data (or a random sample of your data) to two or three dimensions and plot these dimensions to visualize the number of possible clusters your data might have. Then you can use that number in the clustering algorithm you choose (either on the original data or the reduced data). Note that if you do clustering on the reduced data (after doing principal component analysis) the dimensions of your data are no longer equal to the dimensions from your original data (they are a combination of the original dimensions). Therefore, the analysis is somewhat tricky. But we can discuss that in another post. Other options for determining an appropriate number of clusters to look for is using some validity index (here, you would have to do a Google search on "cluster validity index" or get your textbook on cluster analysis).

For us to help you better, it would be good to have a small sample of your data or at least a better description of their structure.

Finally, my recommendation is to try the PDL. You can have a look at the online version of the book to better understand the potential of PDL. By the way, in PDL you can find the mean and standard deviation (together with the median,min, and maximum) with the functions "stats" (over the whole data) or "statsover" (by rows on your matrices). You can find a list of resources related to PDL in the tutorial on The Perl Data Language (PDL): A Quick Reference Guide. And just to give you an idea of how a PDL implementation of a clustering algorithm looks like, below is my implementation of the fuzzy c-means using PDL.

Please, let us know if you have another question

Cheers,

lin0
#!/usr/bin/perl use warnings; use strict; use PDL; # fcm: fuzzy c-means implementation in Perl # usage: $fcm [number_of_clusters] [fuzzification_factor] # [max_iter] [tolerace] # returns: prototypes, partition_matrix # # # reading data # my ( @data, @tmp, $number_of_patterns, $max_row_number, $max_column_nu +mber ); while (defined(my $line = <DATA>)) { chomp ($line); @tmp = split /\s+/, $line; push @data, [ @tmp ]; } $number_of_patterns = @data; my $patterns = pdl(@data); # # assigning other variables # my $number_of_clusters = shift @ARGV; my $fuzzification_factor = shift @ARGV; my $max_iter = shift @ARGV; my $tolerance = shift @ARGV; unless (defined($number_of_clusters)) { $number_of_clusters ||= 2; } unless (defined($fuzzification_factor)) { $fuzzification_factor ||= 2.0; } unless (defined($max_iter)) { $max_iter ||= 40; } unless (defined($tolerance)) { $tolerance ||= 0.00001; } $number_of_clusters = abs($number_of_clusters); $fuzzification_factor = abs($fuzzification_factor); $max_iter = abs($max_iter); $tolerance = abs($tolerance); # # initializing partition matrices # my $previous_partition_matrix; my $current_partition_matrix = initialize_partition_matrix($number_of_clusters, $number_of_patte +rns); # # output variables # my $prototypes; my $performance_index; # # fuzzy c means implementation # $max_row_number = $number_of_patterns - 1; $max_column_number = $number_of_clusters - 1; my $iter = 0; while (1) { # computing each prototype my $temporal_partition_matrix = $current_partition_matrix ** $fuzz +ification_factor; my $temp_prototypes = mv( $temporal_partition_matrix x $patterns, +1,0) / sumover($temporal_partition_matrix); $prototypes = mv($temp_prototypes,1,0); # copying partition matrix $previous_partition_matrix = $current_partition_matrix->copy; # updating the partition matrix my $dist = zeroes $number_of_patterns, $number_of_clusters; for my $i (0..$max_row_number){ for my $j (0..$max_column_number){ my $temp_distance = distance($patterns->slice(":,$i"), $pr +ototypes->slice(":,$j"), \&euclidean ); $dist->set($i, $j, $temp_distance); } } my $temp_variable = $dist ** (-2/($fuzzification_factor - 1)); $current_partition_matrix = $temp_variable / sumover(mv($temp_vari +able,1,0)); # # Performance Index calculation # $temporal_partition_matrix = $current_partition_matrix ** $fuzzifi +cation_factor; $performance_index = sum($temporal_partition_matrix * ( $dist ** 2 + )); # checking stop conditions my $diff_partition_matrix = $current_partition_matrix - $previous_ +partition_matrix; $iter++; if ( ($diff_partition_matrix->max < $tolerance) || ($iter > $max_i +ter) ) { last; } print "iter = $iter\n"; } print "=======================================\n"; print "clustering completed\n"; print "performance index = $performance_index\n"; print "prototypes = \n"; print $prototypes; print "current partition matrix = \n"; print $current_partition_matrix; # ================================ # initialize_partition_matrix # partition_matrix = # initialize_partition_matrix( # num_clusters, num_patterns) # ================================ sub initialize_partition_matrix { my ($partition_matrix, $column_sum); $partition_matrix = random($_[1],$_[0]); $column_sum = sumover (mv($partition_matrix, 1, 0));#sum over column +s $partition_matrix /= $column_sum; return $partition_matrix; } # ==================================== # compute distance between two vectors # dist = distance( vector1, vector2, /&type_of_distance ) # ==================================== sub distance{ my ($vector1, $vector2, $type_of_distance) = @_; my ($r) = $vector1 - $vector2; $type_of_distance->($r); } sub manhattan{ sum(abs($_[0]));} sub euclidean{ sqrt(sum($_[0] ** 2) );} sub tschebyschev{ max(abs($_[0])); } __DATA__ 4.0 4.0 4.0 5.0 5.0 4.0 5.5 6.0 5.0 5.0 4.5 4.5 5.0 5.5 5.5 5.0 5.0 4.5 4.5 5.0 9.5 9.0 9.0 9.5 8.0 8.0 7.0 8.0 8.0 7.0 8.5 7.0 7.0 8.5 7.0 7.0 7.5 7.0 6.5 8.0 8.0 6.5 6.5 7.0 10.0 10.0 10.0 9.0 10.0 9.0 9.5 10.0 8.0 10.0 9.5 9.5 9.0 9.0 9.0 10.0

Update:

fixed copying of partition matrix


In reply to Re: module for cluster analysis by lin0
in thread module for cluster analysis by zli034

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others browsing the Monastery: (6)
    As of 2014-12-25 06:53 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      Is guessing a good strategy for surviving in the IT business?





      Results (159 votes), past polls