http://www.perlmonks.org?node_id=581967

Greetings Fellow Monks,

I wrote a script to perform Fuzzy Clustering with Perl. What clustering does is to search for structure in a dataset. I talked about the script before but I wanted to get some feedback on how to improve the code (in terms of Best Practices).


Updates:

Update 1:

I added a description for the input data. Thanks to zby for pointing that out.

Update 2:

I added a description for the distance function. Thanks to zby for pointing that out.


The algorithm is described in the Wikipedia. In my implementation, the inputs are:

  1. DATA: it is the dataset being clustered. It is expected to be organized in a matrix in which the rows represent patterns and columns represent the features of those patterns. The sample data provided is two dimensional (only two features) but it could certainly have had many more dimensions. Now, you might ask: is there another, more real, data I could cluster? The short answer is yes. You could try the UCI Repository of Machine Learning Databases for different types of datasets
  2. number_of_clusters: it is the number of groups to search for in the dataset
  3. fuzzification_factor: it is a number that affects the shape of the membership functions (which indicate the degree of belongingness of a pattern to a particular cluster) being produced. Typical choices for the fuzzification_factor are in the interval (1.5, 2.5).
  4. max_iter: it is the maximum number of times we would repeat the algorithm in case the convergence is not achieved
  5. tolerance: it is a measure of convergence. If two successive partition matrices (see the description when talking about the outputs) differ by less than the tolerance, then stop.

The resulting outputs are:

  1. prototypes: they are representative patterns for each group
  2. partition_matrix: it is a matrix in which the number of rows is equal to the number_of_clusters and the number of columns is equal to the number of patterns in the dataset. Each element in the matrix represents the degree to which the corresponding pattern belongs to a particular group.

A key element in any clustering method is the distance function. A distance function indicates how far things are. The most common used distance functions are:

For other distance functions, please have a look at the Wikipedia

It is important to note that the type of distance affects the shape of the clusters that are produced. For the Manhattan distance, the clusters will have a diamond shape. For the Euclidean distance, the clusters will be circular. For the Tschebyschev distance, the clusters will be squares. The distance function has also a significant impact in the calculation of the prototypes. For the Euclidean distance, the resulting prototypes can be interpreted as the weighted means of the clusters. In the case of the Manhattan distance, the resulting prototypes can be interpreted as the weighted medians of the clusters.

Note: In this script, we are using the Euclidean distance.

Now, to use the program you can:

  1. Run the code available below without any argument (without arguments it will search for two groups). One of the groups will have a prototype around (8.7, 8.8) while the second one will have a prototype around (5.0, 5.1)
  2. If you run the code providing, for example, a number_of_clusters equal to 3, one of the groups will have a prototype around (9.4, 9.5), a second group will have a prototype around (7.4, 7.4), while the third one will have a prototype around (4.8, 4.8). Please, note that the dataset contains three groups. To verify this, you can plot the dataset using your favourite program. I suggest you use an XY chart. On this chart, you might notice that there are three groups of patterns: one around (5, 5), another around (7.5, 7.5) and the other around (9.5, 9.5).

The source code is available below.

Any comment will be greatly appreciated.

Cheers,

lin0


#!/usr/bin/perl use warnings; use strict; # fcm: fuzzy c-means implementation in Perl # usage: $fcm [number_of_clusters] [fuzzification_factor] # [max_iter] [tolerace] # returns: prototypes, partition_matrix # # # reading data # my ( @data, @tmp, $rows, $columns ); while (defined(my $line = <DATA>)) { chomp ($line); @tmp = split /\s+/, $line; push @data, [ @tmp ]; } $columns = @tmp; $rows = @data; # # assigning other variables # my $number_of_clusters = shift @ARGV; my $fuzzification_factor = shift @ARGV; my $max_iter = shift @ARGV; my $tolerance = shift @ARGV; unless (defined($number_of_clusters)) { $number_of_clusters = 2; } unless (defined($fuzzification_factor)) { $fuzzification_factor = 2.0; } unless (defined($max_iter)) { $max_iter = 4000; } unless (defined($tolerance)) { $tolerance = 0.00001; } # # initializing partition matrices # my @previous_partition_matrix; my @current_partition_matrix = initialize_partition_matrix($number_of_clusters, $rows); # # fuzzy c means implementation # my ($iter, @dist, $sum_numerator, $sum_denominator, @prototypes); $iter = 0; while ( 1 ){ # computing each prototype for (my $i = 0; $i < $number_of_clusters; $i++) { for (my $k = 0; $k < $columns; $k++) { $sum_numerator = 0; $sum_denominator = 0; for (my $j = 0; $j < $rows; $j++) { $sum_numerator += $data[$j][$k] * ($current_partition_matrix[$i][$j] ** $fuzzification_factor); $sum_denominator += $current_partition_matrix[$i][$j] ** $fuzzification_factor; } $prototypes[$i][$k] = $sum_numerator / $sum_denominator; } } # copying partition matrix for (my $i = 0; $i < $number_of_clusters; $i++) { for (my $j = 0; $j < $rows; $j++) { $previous_partition_matrix[$i][$j] = $current_partition_matrix[$i][$j]; } } # updating the partition matrix my ($sum, @pattern_is_prototype); for (my $i = 0; $i < $number_of_clusters; $i++) { for (my $j = 0; $j < $rows; $j++) { $dist[$i][$j] = distance( $prototypes[$i], $data[$j] ); } } for (my $i = 0; $i < $number_of_clusters; $i++) { for (my $j = 0; $j < $rows; $j++) { $sum = 0.0; if ( $dist[$i][$j] == 0 ) { $current_partition_matrix[$i][$j] = 1.0; } else { for (my $l = 0; $l < $number_of_clusters; $l++) { $sum += ( $dist[$i][$j]/$dist[$l][$j] ) ** ( 2.0 / ($fuzzification_factor - 1.0) ) unless ($dist[$l][$j] == 0); } $current_partition_matrix[$i][$j] = 1.0 / $sum unless ($sum == 0); } } } # checking stop conditions last if ( ++$iter == $max_iter ); my ($dif, $max_dif); $max_dif = 0; CLUSTER: for (my $i = 0; $i < $number_of_clusters; $i++){ for (my $j = 0; $j < $rows; $j++){ $dif = abs( $current_partition_matrix[$i][$j] - $previous_partition_matrix[$i][$j] ); $max_dif = $dif if ($dif > $max_dif); last CLUSTER if ( $max_dif > $tolerance ); } } # print "max dif= $max_dif\n"; last if ($max_dif < $tolerance); } # # Performance Index calculation # my $performance_index; for (my $i = 0; $i < $number_of_clusters; $i++){ for (my $j = 0; $j < $rows; $j++){ my $dij = distance( $prototypes[$i], $data[$j] ); $performance_index += ($current_partition_matrix[$i][$j] ** $fuzzification_factor) * $dij; } } print "Clustering completed ...\n"; for (my $i = 0; $i < $number_of_clusters; $i++) { for (my $k = 0; $k < $columns; $k++) { print "Prototype[$i][$k]: $prototypes[$i][$k]\n"; } } print "performance index: $performance_index\n"; print "number of iterations: $iter\n"; for (my $i = 0; $i < $number_of_clusters; $i++) { for (my $j = 0; $j < $rows; $j++){ print "U[$i][$j] = $current_partition_matrix[$i][$j]\n"; } } # ================================ # initialize_partition_matrix # partition_matrix = # initialize_partition_matrix( # num_clusters, num_patterns) # ================================ sub initialize_partition_matrix { srand; my (@partition_matrix, @column_sum); for (my $i = 0; $i < $_[0]; $i++){ for (my $j = 0; $j < $_[1]; $j++){ $partition_matrix[$i][$j] = rand; $column_sum[$j] += $partition_matrix[$i][$j]; } } for (my $i = 0; $i < $_[0]; $i++){ for (my $j = 0; $j < $_[1]; $j++){ die "column [$j] sum is equal to zero\n" unless $column_sum[$j]; $partition_matrix[$i][$j] /= $column_sum[$j]; } } return @partition_matrix; } # ==================================== # compute distance between two vectors # dist = distance( vector1, vector2 ) # ==================================== sub distance { my $vector1 = shift; my $vector2 = shift; my $sum = 0; for ( my $i = 0; $i < scalar @{$vector1}; $i++ ){ my $difference = ${ $vector1 }[$i] - ${ $vector2 }[$i]; $sum += $difference * $difference; } return sqrt( $sum ); } __DATA__ 4.0 4.0 4.0 5.0 5.0 4.0 5.5 6.0 5.0 5.0 4.5 4.5 5.0 5.5 5.5 5.0 5.0 4.5 4.5 5.0 9.5 9.0 9.0 9.5 8.0 8.0 7.0 8.0 8.0 7.0 8.5 7.0 7.0 8.5 7.0 7.0 7.5 7.0 6.5 8.0 8.0 6.5 6.5 7.0 10.0 10.0 10.0 9.0 10.0 9.0 9.5 10.0 8.0 10.0 9.5 9.5 9.0 9.0 9.0 10.0