snippet
lin0
<p>Many [http://en.wikipedia.org/wiki/Machine_learning|machine learning] and [http://en.wikipedia.org/wiki/Data_Analysis|data analysis] tasks involve calculating [http://en.wikipedia.org/wiki/Distance|distances] between items. The [http://en.wikipedia.org/wiki/Mahalanobis_distance|Mahalanobis distance] is a very popular distance because it is [http://en.wikipedia.org/wiki/Scale_invariant|scale invariant]. </p>
<p>In this snippet, I present how to compute the [http://en.wikipedia.org/wiki/Mahalanobis_distance|Mahalanobis distance] using the [http://pdl.perl.org/|Perl Data Language]. The inputs are two or three piddles (see comment below for a definition). The first piddle is a p-dimensional vector. The second piddle could be either a p-dimensional vector (when a third input is provided) or a matrix with N rows of p-dimensional vectors. If the second piddle is a matrix, the distance is computed between the center of the second piddle and the first piddle (if only two inputs are provided, the second piddle is used to compute the covariance needed to determine the Mahalanobis distance). The third piddle, which is optional, represents the [http://en.wikipedia.org/wiki/Covariance_matrix|covariance matrix] of the distribution from which the two other piddles were drawn. Note: to compute the [http://en.wikipedia.org/wiki/Covariance_matrix|covariance matrix], I use the snippet presented in [id://625532]</p>
<p>What are Piddles?</p>
<p>They are a new data structure defined in the [http://pdl.perl.org/|Perl Data Language]. As indicated in [id://598007]:</p>
<blockquote><i>Piddles are numerical arrays stored in column major order (meaning that the fastest varying dimension represent the columns following computational convention rather than the rows as mathematicians prefer). Even though, piddles look like Perl arrays, they are not. Unlike Perl arrays, piddles are stored in consecutive memory locations facilitating the passing of piddles to the C and FORTRAN code that handles the element by element arithmetic. One more thing to note about piddles is that they are referenced with a leading $</i></blockquote>
<p>Cheers,</p>
<p>[lin0]</p>
<CODE>
#!/usr/bin/perl
use warnings;
use strict;
use PDL;
# ================================
# mahalanobis:
#
# $distance = mahalanobis( $x, $y, $cov )
#
# computes the mahalanobis distance from a point
# $x to another point $y (from the same
# distribution) or from a point $x to
# the centre of a group of values $y
#
# ================================
sub mahalanobis {
my ( $x, $y, $cov, $diff, $dist );
if ( @_ < 3 ) {
( $x, $y ) = @_;
$cov = covariance( $y );
} else {
( $x, $y, $cov ) = @_;
}
if ( $y->getdim(1) > 1 ) {
$diff = $x - average( $y->xchg(0,1) );
} else {
$diff = $x - $y;
}
my @dist = list( $diff x inv( $cov ) x transpose( $diff ) );
return $dist[0];
}
</CODE>