Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Computing the Mahalanobis distance with the Perl Data Language

by lin0 (Curate)
on Jul 09, 2007 at 22:03 UTC ( #625719=snippet: print w/replies, xml ) Need Help??

Many machine learning and data analysis tasks involve calculating distances between items. The Mahalanobis distance is a very popular distance because it is scale invariant.

In this snippet, I present how to compute the Mahalanobis distance using the Perl Data Language. The inputs are two or three piddles (see comment below for a definition). The first piddle is a p-dimensional vector. The second piddle could be either a p-dimensional vector (when a third input is provided) or a matrix with N rows of p-dimensional vectors. If the second piddle is a matrix, the distance is computed between the center of the second piddle and the first piddle (if only two inputs are provided, the second piddle is used to compute the covariance needed to determine the Mahalanobis distance). The third piddle, which is optional, represents the covariance matrix of the distribution from which the two other piddles were drawn. Note: to compute the covariance matrix, I use the snippet presented in Computing Covariance Matrices with PDL

What are Piddles?

They are a new data structure defined in the Perl Data Language. As indicated in RFC: Getting Started with PDL (the Perl Data Language):

Piddles are numerical arrays stored in column major order (meaning that the fastest varying dimension represent the columns following computational convention rather than the rows as mathematicians prefer). Even though, piddles look like Perl arrays, they are not. Unlike Perl arrays, piddles are stored in consecutive memory locations facilitating the passing of piddles to the C and FORTRAN code that handles the element by element arithmetic. One more thing to note about piddles is that they are referenced with a leading $



use warnings;
use strict;
use PDL;

# ================================
# mahalanobis: 
#   $distance = mahalanobis( $x, $y, $cov )
#   computes the mahalanobis distance from a point
#   $x to another point $y (from the same 
#   distribution) or from a point $x to
#   the centre of a group of values $y
# ================================
sub mahalanobis {
    my ( $x, $y, $cov, $diff, $dist );
    if ( @_ < 3 ) {
        ( $x, $y ) = @_;
        $cov = covariance( $y );
    } else {
        ( $x, $y, $cov ) = @_;
    if ( $y->getdim(1) > 1 ) {
        $diff = $x - average( $y->xchg(0,1) );
    } else {
        $diff = $x - $y;
    my @dist = list( $diff x inv( $cov ) x transpose( $diff ) );
    return $dist[0];
Replies are listed 'Best First'.
Re: Computing the Mahalanobis distance with the Perl Data Language
by dmorgo (Pilgrim) on Aug 13, 2007 at 09:13 UTC
    Fantastic! Thanks for sharing this!
Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://625719]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2017-08-22 16:49 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (337 votes). Check out past polls.