Computing the Mahalanobis distance with the Perl Data Language

Many machine learning and data analysis tasks involve calculating distances between items. The Mahalanobis distance is a very popular distance because it is scale invariant.

In this snippet, I present how to compute the Mahalanobis distance using the Perl Data Language. The inputs are two or three piddles (see comment below for a definition). The first piddle is a p-dimensional vector. The second piddle could be either a p-dimensional vector (when a third input is provided) or a matrix with N rows of p-dimensional vectors. If the second piddle is a matrix, the distance is computed between the center of the second piddle and the first piddle (if only two inputs are provided, the second piddle is used to compute the covariance needed to determine the Mahalanobis distance). The third piddle, which is optional, represents the covariance matrix of the distribution from which the two other piddles were drawn. Note: to compute the covariance matrix, I use the snippet presented in Computing Covariance Matrices with PDL

What are Piddles?

They are a new data structure defined in the Perl Data Language. As indicated in RFC: Getting Started with PDL (the Perl Data Language):

Piddles are numerical arrays stored in column major order (meaning that the fastest varying dimension represent the columns following computational convention rather than the rows as mathematicians prefer). Even though, piddles look like Perl arrays, they are not. Unlike Perl arrays, piddles are stored in consecutive memory locations facilitating the passing of piddles to the C and FORTRAN code that handles the element by element arithmetic. One more thing to note about piddles is that they are referenced with a leading $

Cheers,

lin0

#!/usr/bin/perl
use warnings;
use strict;
use PDL;

# ================================
# mahalanobis: 
#
#   $distance = mahalanobis( $x, $y, $cov )
#
#   computes the mahalanobis distance from a point
#   $x to another point $y (from the same 
#   distribution) or from a point $x to
#   the centre of a group of values $y
#
# ================================
sub mahalanobis {
    my ( $x, $y, $cov, $diff, $dist );
    if ( @_ < 3 ) {
        ( $x, $y ) = @_;
        $cov = covariance( $y );
    } else {
        ( $x, $y, $cov ) = @_;
    }
    
    if ( $y->getdim(1) > 1 ) {
        $diff = $x - average( $y->xchg(0,1) );
    } else {
        $diff = $x - $y;
    }
    
    my @dist = list( $diff x inv( $cov ) x transpose( $diff ) );
    
    return $dist[0];
}
[download]

Comment on Computing the Mahalanobis distance with the Perl Data Language Download Code


Your skill will accomplish what the force of many cannot
	PerlMonks