Greetings Fellow Monks,

I wrote a script to perform Fuzzy Clustering with Perl. What clustering does is to search for structure in a dataset. I talked about the script before but I wanted to get some feedback on how to improve the code (in terms of Best Practices).

Update 1:

I added a description for the input data. Thanks to zby for pointing that out.

Update 2:

I added a description for the distance function. Thanks to zby for pointing that out.

The algorithm is described in the Wikipedia. In my implementation, the inputs are:

1. DATA: it is the dataset being clustered. It is expected to be organized in a matrix in which the rows represent patterns and columns represent the features of those patterns. The sample data provided is two dimensional (only two features) but it could certainly have had many more dimensions. Now, you might ask: is there another, more real, data I could cluster? The short answer is yes. You could try the UCI Repository of Machine Learning Databases for different types of datasets
2. number_of_clusters: it is the number of groups to search for in the dataset
3. fuzzification_factor: it is a number that affects the shape of the membership functions (which indicate the degree of belongingness of a pattern to a particular cluster) being produced. Typical choices for the fuzzification_factor are in the interval (1.5, 2.5).
4. max_iter: it is the maximum number of times we would repeat the algorithm in case the convergence is not achieved
5. tolerance: it is a measure of convergence. If two successive partition matrices (see the description when talking about the outputs) differ by less than the tolerance, then stop.

The resulting outputs are:

1. prototypes: they are representative patterns for each group
2. partition_matrix: it is a matrix in which the number of rows is equal to the number_of_clusters and the number of columns is equal to the number of patterns in the dataset. Each element in the matrix represents the degree to which the corresponding pattern belongs to a particular group.

A key element in any clustering method is the distance function. A distance function indicates how far things are. The most common used distance functions are:

For other distance functions, please have a look at the Wikipedia

It is important to note that the type of distance affects the shape of the clusters that are produced. For the Manhattan distance, the clusters will have a diamond shape. For the Euclidean distance, the clusters will be circular. For the Tschebyschev distance, the clusters will be squares. The distance function has also a significant impact in the calculation of the prototypes. For the Euclidean distance, the resulting prototypes can be interpreted as the weighted means of the clusters. In the case of the Manhattan distance, the resulting prototypes can be interpreted as the weighted medians of the clusters.

Note: In this script, we are using the Euclidean distance.

Now, to use the program you can:

1. Run the code available below without any argument (without arguments it will search for two groups). One of the groups will have a prototype around (8.7, 8.8) while the second one will have a prototype around (5.0, 5.1)
2. If you run the code providing, for example, a number_of_clusters equal to 3, one of the groups will have a prototype around (9.4, 9.5), a second group will have a prototype around (7.4, 7.4), while the third one will have a prototype around (4.8, 4.8). Please, note that the dataset contains three groups. To verify this, you can plot the dataset using your favourite program. I suggest you use an XY chart. On this chart, you might notice that there are three groups of patterns: one around (5, 5), another around (7.5, 7.5) and the other around (9.5, 9.5).

The source code is available below.

Any comment will be greatly appreciated.

Cheers,

```#!/usr/bin/perl
use warnings;
use strict;

# fcm: fuzzy c-means implementation in Perl
# usage: \$fcm [number_of_clusters] [fuzzification_factor]
#        [max_iter] [tolerace]
# returns: prototypes, partition_matrix
#

#
#
my ( @data, @tmp, \$rows, \$columns );

while (defined(my \$line = <DATA>)) {
chomp (\$line);
@tmp = split /\s+/, \$line;
push @data, [ @tmp ];
}
\$columns = @tmp;
\$rows    = @data;

#
# assigning other variables
#
my \$number_of_clusters   = shift @ARGV;
my \$fuzzification_factor = shift @ARGV;
my \$max_iter             = shift @ARGV;
my \$tolerance            = shift @ARGV;

unless (defined(\$number_of_clusters)) {
\$number_of_clusters = 2;
}

unless (defined(\$fuzzification_factor)) {
\$fuzzification_factor = 2.0;
}
unless (defined(\$max_iter)) { \$max_iter = 4000; }
unless (defined(\$tolerance)) { \$tolerance = 0.00001; }

#
# initializing partition matrices
#
my @previous_partition_matrix;
my @current_partition_matrix =
initialize_partition_matrix(\$number_of_clusters, \$rows);

#
# fuzzy c means implementation
#
my (\$iter, @dist, \$sum_numerator, \$sum_denominator, @prototypes);
\$iter = 0;

while ( 1 ){
# computing each prototype
for (my \$i = 0; \$i < \$number_of_clusters; \$i++) {
for (my \$k = 0; \$k < \$columns; \$k++) {
\$sum_numerator = 0;
\$sum_denominator = 0;
for (my \$j = 0; \$j < \$rows; \$j++) {
\$sum_numerator +=
\$data[\$j][\$k] * (\$current_partition_matrix[\$i][\$j] **
\$fuzzification_factor);
\$sum_denominator +=
\$current_partition_matrix[\$i][\$j] ** \$fuzzification_factor;
}
\$prototypes[\$i][\$k] = \$sum_numerator / \$sum_denominator;
}
}
# copying partition matrix
for (my \$i = 0; \$i < \$number_of_clusters; \$i++) {
for (my \$j = 0; \$j < \$rows; \$j++) {
\$previous_partition_matrix[\$i][\$j] =
\$current_partition_matrix[\$i][\$j];
}
}
# updating the partition matrix
my (\$sum, @pattern_is_prototype);
for (my \$i = 0; \$i < \$number_of_clusters; \$i++) {
for (my \$j = 0; \$j < \$rows; \$j++) {
\$dist[\$i][\$j] = distance( \$prototypes[\$i], \$data[\$j] );
}
}
for (my \$i = 0; \$i < \$number_of_clusters; \$i++) {
for (my \$j = 0; \$j < \$rows; \$j++) {
\$sum = 0.0;
if ( \$dist[\$i][\$j] == 0 ) {
\$current_partition_matrix[\$i][\$j] = 1.0;
} else {
for (my \$l = 0; \$l < \$number_of_clusters; \$l++) {
\$sum += ( \$dist[\$i][\$j]/\$dist[\$l][\$j] )
** ( 2.0 / (\$fuzzification_factor - 1.0) )
unless (\$dist[\$l][\$j] == 0);
}
\$current_partition_matrix[\$i][\$j] = 1.0 / \$sum
unless (\$sum == 0);
}
}
}
# checking stop conditions
last if ( ++\$iter == \$max_iter );
my (\$dif, \$max_dif);
\$max_dif = 0;
CLUSTER:    for (my \$i = 0; \$i < \$number_of_clusters; \$i++){

for (my \$j = 0; \$j < \$rows; \$j++){

\$dif = abs( \$current_partition_matrix[\$i][\$j] -
\$previous_partition_matrix[\$i][\$j] );
\$max_dif = \$dif if (\$dif > \$max_dif);
last CLUSTER if ( \$max_dif > \$tolerance );

}

}
#    print "max dif= \$max_dif\n";
last if (\$max_dif < \$tolerance);

}

#

# Performance Index calculation

#
my \$performance_index;

for (my \$i = 0; \$i < \$number_of_clusters; \$i++){

for (my \$j = 0; \$j < \$rows; \$j++){

my \$dij = distance( \$prototypes[\$i], \$data[\$j] );

\$performance_index +=
(\$current_partition_matrix[\$i][\$j] **
\$fuzzification_factor) * \$dij;

}

}

print "Clustering completed ...\n";
for (my \$i = 0; \$i < \$number_of_clusters; \$i++) {
for (my \$k = 0; \$k < \$columns; \$k++) {
print "Prototype[\$i][\$k]: \$prototypes[\$i][\$k]\n";
}
}
print "performance index: \$performance_index\n";
print "number of iterations: \$iter\n";

for (my \$i = 0; \$i < \$number_of_clusters; \$i++) {
for (my \$j = 0; \$j < \$rows; \$j++){
print "U[\$i][\$j] = \$current_partition_matrix[\$i][\$j]\n";
}
}

# ================================
# initialize_partition_matrix
# partition_matrix =
#   initialize_partition_matrix(
#     num_clusters, num_patterns)
# ================================
sub initialize_partition_matrix {
srand;
my (@partition_matrix, @column_sum);

for (my \$i = 0; \$i < \$_[0]; \$i++){
for (my \$j = 0; \$j < \$_[1]; \$j++){
\$partition_matrix[\$i][\$j] = rand;
\$column_sum[\$j] += \$partition_matrix[\$i][\$j];
}
}

for (my \$i = 0; \$i < \$_[0]; \$i++){
for (my \$j = 0; \$j < \$_[1]; \$j++){
die "column [\$j] sum is equal to zero\n"
unless \$column_sum[\$j];
\$partition_matrix[\$i][\$j] /= \$column_sum[\$j];
}
}
return @partition_matrix;
}

# ====================================
# compute distance between two vectors
# dist = distance( vector1, vector2 )
# ====================================
sub distance {
my \$vector1 = shift;
my \$vector2 = shift;

my \$sum = 0;
for ( my \$i = 0; \$i < scalar @{\$vector1}; \$i++ ){
my \$difference = \${ \$vector1 }[\$i] - \${ \$vector2 }[\$i];
\$sum += \$difference * \$difference;
}
return sqrt( \$sum );
}

__DATA__
4.0    4.0
4.0    5.0
5.0    4.0
5.5    6.0
5.0    5.0
4.5    4.5
5.0    5.5
5.5    5.0
5.0    4.5
4.5    5.0
9.5    9.0
9.0    9.5
8.0    8.0
7.0    8.0
8.0    7.0
8.5    7.0
7.0    8.5
7.0    7.0
7.5    7.0
6.5    8.0
8.0    6.5
6.5    7.0
10.0    10.0
10.0    9.0
10.0    9.0
9.5    10.0
8.0    10.0
9.5    9.5
9.0    9.0
9.0    10.0

Replies are listed 'Best First'.
Re: RFC: Fuzzy Clustering with Perl
by jhourcle (Prior) on Nov 03, 2006 at 15:59 UTC

I'm going to guess that this is a straight transfer of a C program to Perl --

First, you'd using a whole lot of for loops for tracking indexes to arrays:

for (my \$i = 0; \$i < \$number_of_clusters; \$i++) { ... }

In Perl, if you're just trying to iterate over a range, you can use the 'foreach' style loop, with the range operator:

for my \$i ( 0 .. \$number_of_clusters-1 ) { ... }

Even if we were doing this in C, for the type of loops you're dealing with (starting at 0, order of operations doesn't matter), I'd still change the code, to reduce the number of comparisons against non-0 values:

for (i = number_of_clusters; i--; ) { ... }

In some cases, we can use the 'foreach' style loops to eliminate the need for index values, but because you're using them to index multiple arrays within the loop, that's not possible in your case.

...

Another change I might make is in how you deal with undefined values -- if the value must be defined, and can't be 0, (eg, \$number_of_clusters), you can use the '||=' operator:

\$number_of_clusters ||= 2;

...

The only other thing is in how it's called -- if it were OO, you could inherit from it, and then replace the 'distance' function (or you could have it accept a coderef in for the distance function, if you didn't want to support inheritance), as some people prefer the manhatten distance when they're dealing with clusters:

```sub distance {
my (\$vec1,\$vec2) = @_;
my \$distance = 0;
for my \$i ( 0 .. (scalar @\$vec1)-1 ) {
\$difference += abs( \$vec1->[\$i] - \$vec2->[\$i] );
}
return \$difference;
}

jhourcle

Thank you very much for your comments. I will work on a new version of the script, following your suggestions, and I will post it some time next week.

"I'm going to guess that this is a straight transfer of a C program to Perl --"

This is a very good guess. In fact, it is true! I have been programming in Perl for three months now and I recognize that I have a long way to go. This is one of the reasons I am asking for comments. So, I can improve my Perl coding skills in the least amount of time possible.

"First, you'd using a whole lot of for loops for tracking indexes to arrays:"

that is true. I am trying to overcome that habit

"In Perl, if you're just trying to iterate over a range, you can use the 'foreach' style loop, with the range operator:
```for my \$i ( 0 .. \$number_of_clusters-1 ) { ... }
"Even if we were doing this in C, for the type of loops you're dealing with (starting at 0, order of operations doesn't matter), I'd still change the code, to reduce the number of comparisons against non-0 values:"
```for (i = number_of_clusters; i--; ) { ... }

I like the two options. However, maybe the first one is easier to understand for someone who is new to Perl. For the second one, you must have clear that in Perl the evaluation of \$i is done first allowing the loop to continue and then the variable is decreased. This might be hard to see for someone new to the language (I had to try it to see what it did)

"Another change I might make is in how you deal with undefined values -- if the value must be defined, and can't be 0, (eg, \$number_of_clusters), you can use the '||=' operator:"
```\$number_of_clusters ||= 2;

thank you for the pointer. Trying the ||= operator made me realize that the \$number_of_cluster cannot be negative either. So maybe I should do

```my \$number_of_clusters   = abs(shift @ARGV);

followed by the line you suggested. Is there another way around that?

"The only other thing is in how it's called -- if it were OO, you could inherit from it, and then replace the 'distance' function (or you could have it accept a coderef in for the distance function, if you didn't want to support inheritance), as some people prefer the manhatten distance when they're dealing with clusters:"

Thanks again.

lin0
Just out of curiosity, why are you implementing this in Perl? If the number of features goes beyond, say 5, for any reasonable dataset, this will be too slow to be of much use. I'd think this is the sort of thing you'd implement in C and then provide Perl bindings for...
Just a nit:
for my \$i ( 0 .. \$number_of_clusters-1 ) { ... }

Should be:
for my \$i ( 0 .. \$#number_of_clusters ) { ... }

Update: Hrm, whups, I read that first line as "0..@number_of_clusters", apparently multiple times. It should not be "0 .. \$#number_of_clusters" since @number_of_clusters doesn't exist. It might be more clearly written "1 .. \$number_of_clusters" though, =]

Hi BUU,

I think that the first option is the correct one because I want the loop to run for \$number_of_clusters times. Because the loop starts in 0, it should go up to \$number_of_clusters-1.

Cheers!

lin0
OT Re: RFC: Fuzzy Clustering with Perl
by zby (Vicar) on Nov 03, 2006 at 09:14 UTC
This is the kind of thing that keeps me tied to Perl. You know - you hear about some theoretical method of solving some kind of problems, you search CPAN and there it is - ready for use.

Hi zby,

Thank you for the comment. I am working on bringing more code I had written in other languages to Perl. I hope they are of use to Fellow Monks.

Cheers!

lin0
Re: RFC: Fuzzy Clustering with Perl
by lin0 (Curate) on Nov 04, 2006 at 20:07 UTC

I just added a description for the distance function. The next step will be to modify the script in such a way that the user can select between different distance functions. This might seem trivial at first but it is not. As I mentioned in the description of the distance function, different distance functions will result in different ways of computing the prototypes. If you have any suggestion on how I could generalize the computation of the prototypes based on the selected distance function, please, drop me a line

Thank you,

lin0
You might want to encapsulate the code under a sub, where one of parameters will be the distance function name. Other parameters could also be named parameter, as well. This is a very widely spread perl technique. For example, this sub would be called then as
```my \$result = cluster(
num_clusters => 5,
data         => [1,2,3,4,5],
distance     => 'procrustes'
);
This way, if the programmer does not specify all parameters, there will be subsituted sensible defaults ( possibly different for each distance ). Also the sub can die if there's not enough or invalid parameters given.

Hi dk

Thanks for the pointer. Now, I am working on a new version of the script. I will have it ready this weekend. I hope you could have a look at it and make some comments

Cheers!

lin0
Re: RFC: Fuzzy Clustering with Perl
by MadraghRua (Vicar) on Nov 07, 2006 at 19:47 UTC
I see that you have gone with a distance measurement. When doing this a while back, we included correlations as well. So you could look at the data either as a function of a measurement (distance) or a statistical (correlation coeficient for parametric and Spearman Rank Correlation for non parpametric analysis). Perhaps you could think about allowing the algorithm to use either metric or statistical methods to increase the module's utility at some point.

If you look on CPAN you'll find Statistics::RankCorrelation. If you would like to calculate your distances outside of your module, there is Math-NumberCruncher. Mind you I could find nothing on Tschebyschev in CPAN though.

yet another biologist hacking perl....

Thank you very much for the comments. I did not know about the Math::NumberCruncher module. I will see how I can use it.

About the part where you said

“I see that you have gone with a distance measurement. When doing this a while back, we included correlations as well. So you could look at the data either as a function of a measurement (distance) or a statistical (correlation coeficient for parametric and Spearman Rank Correlation for non parpametric analysis). Perhaps you could think about allowing the algorithm to use either metric or statistical methods to increase the module's utility at some point.”

Could you please give me some references to your work? So, I could have a look at it and learn from it. I am new to Perl (only three months into this beautiful language) but I am really eager to learn as fast as I can. Specially, I am interested in pointers to the use of Perl in Scientific applications (sort of BioPerl but in other areas too)

Cheers!

lin0
Unfortunately I don't have a refernece to our program - it was an expression analysis software called Xpression by a company called InforMax - now a part of Invitrogen Corp. We built a few software systems to handle biological data for arrays, sequence analysis and pathway analysis. If you're a student or in an academic or govenment lab you can get Vector NTI for free and it has an API that you might find interesting to play with. (Shameless product placement :-). We no longer distribute Xpression but we do use it for internal analysis and development.

So the usual trend in expression analysis softwares is to normalize the data, filter and sort the data based on various criteria and finally analyse these by a variety of techniques, including clustering algorithms, neural networks, population based statistical approaches and so on. What we were doing was very similar to Rosetta or any of the other commercial softwares that still exist. Check them out. You should also have a gander at BioConductor - its a site of alorithms for analyzing different types of biological data and they are all written in R. You might find it amusing to learn that and then port the algorithm to Perl. Or simply learn to send data to a BioConductor app and get it back.

If you're interested in the references for expression analysis, try out Microarray Bioinformatics by Dov Stekel and follow the references in there. Or for a more mathematical approach, try Giovanni Parmigiani et al in The analysis of gene expression data: methods and software by Springer. I would also have a look through the Quantiative Applications in the Social Sciences series by Sage Publications. They have a nice way of taking a mathematical or statistical approach and framing it simply for us biology types and they have a nice little pamphlet on clustering.

For scientific applications in Perl, I would try out Mastering Algorithms with Perl by Orwant et al from O'Reilly or Advanced Perl Programming (first edition not second) by Srinivasan also in O'Reilly. Both those will teah you good programming practices - for instance you could gain efficiencies in your code by passing references to your arrays and dereferencing them elsewhere, rather than swapping your arrays back and forth. Something to think about in future.

Its also good to look at Pavel Pevzner's books - he's a very well respected bioinformaticist and his books explain the process of algorithm development at a very accessible level. I usually buy one of his books for each of my employees at Christmas when he puts out a new one.