 No such thing as a small change PerlMonks

### Faster Statistics for Discrete Data

by RhetTbull (Curate)
 on Feb 21, 2002 at 03:17 UTC ( #146691=CUFP: print w/replies, xml ) Need Help??

### Background

I use Statistics::Descriptive a fair bit. It's a great module that's been around a long time and is well tested. However, if you need to use the functionality in Statistics::Descriptive::Full, it stores your entire data set in an array. If you have, oh, 2 million plus data points, then that array gets rather large. On a recent real-world data set that I've been analyzing, I had several sets of data with 2.6 million data points. Statistics::Descriptive crunched the data ok but it took 10 minutes and more than 400MB of memory! Since I needed to analyze a lot of data I wanted to find a better way.

### There's got to be a better way

Although my data set is large, I happen to know an interesting thing about the data: it is satellite telemetry that's discretized (is that a word?) into a 4 bit word. That means there's at most 16 possible values for each data point. All of the statistics I'm interested in can be calculated if I know what values I saw and how many times I saw each value. Aha! Sounds like a job for a hash.

So, instead of storing every data point in an array, I only store the values I've seen and the number of times I've seen them in a hash.

### Implementation

Using the hash idea, I implemented this module (named Statistics::Descriptive::Discretized for now). The data is stored in a hash instead of an array. This works very well if you have a limited number of discrete values in your data set. I've tested this with simulated 16 bit output (meaning 2^16 possible values) and it scales quite well, even with 1 million+ data points with 65,536 possible values. If your input data is not limited to discrete values then this will probably perform worse than the array method used by Statistics::Descriptive.

I've tried to keep the interface as close as possible to the Statistics::Descriptive interface. This is a rough draft and all of the routines in Statistics::Descriptive are not fully implemented yet. (Indeed, any that depend on the original order of the data can't be implemented with this method). For many purposes, this module should be a drop in replacement for Statistics::Descriptive.

### Results

I tested this module (using Statistics::Descriptive as a baseline) against several large real world data sets. Statistics::Descriptive::Discretized scales linearly and blows the socks off of Statistics::Descriptive. (I'm not knocking the excellent Statistics::Descriptive -- it's a great module! I just present an alternative that works better for certain data sets). Here are some results:
 Data points Run Time (sec)Statistics::Descriptive Run Time (sec)Discretized 100000 12 1.5 200000 24 3 300000 35 4 500000 59 7 700000 87 10 1000000 119 14 1500000 215 21 2000000 456 29 2600000 561 40

As you can see, after a million points, Statistics::Descriptive starts to scale somewhat exponentially but the Discretized version stays linear. For the test case with 2.6 million data points, this module is 14 times faster than the baseline (and it uses only a few MB of RAM while Statistics::Descriptive uses more than 400MB for this data set!)

### The Code

Here's a sample program that shows how to use it. If you have any suggestions, critiques, etc. please fire away. If this seems like a useful thing, I'll clean it up for the CPAN.
```#!/usr/bin/perl

use strict;
use warnings;

my \$stats = new StatisticsDescriptiveDiscretized;

#some random test data
my @data = qw(2 7 5 1 13 1 10 6 4 1 4 7 11 6 10 15 0 6 7 8);

print "count = ",\$stats->count(),"\n";
print "uniq  = ",\$stats->uniq(),"\n";
print "sum = ",\$stats->sum(),"\n";
print "min = ",\$stats->min(),"\n";
print "max = ",\$stats->max(),"\n";
print "mean = ",\$stats->mean(),"\n";
print "standard_deviation = ",\$stats->standard_deviation(),"\n";
print "variance = ",\$stats->variance(),"\n";
print "sample_range = ",\$stats->sample_range(),"\n";
print "mode = ",\$stats->mode(),"\n";
print "median = ",\$stats->median(),"\n";

#print out the frequency distribution
print "\nvalue\t\tfrequency\n";
print "-"x20,"\n";
my %histogram = \$stats->frequency_distribution(5);
print "\$_\t\t\$histogram{\$_}\n" foreach (sort {\$a <=> \$b} keys %histogr
+am);

BEGIN {

package StatisticsDescriptiveDiscretized;

### This module draws heavily from Statistics::Descriptive

use strict;
use warnings;
use Carp;
use vars qw(\$VERSION \$REVISION \$AUTOLOAD \$DEBUG %autosubs);

\$VERSION = '0.1';
\$REVISION = '\$Revision: 1.17 \$';
\$DEBUG = 0;

%autosubs = (
count                 => undef,
mean                  => undef,
sum                   => undef,
uniq                  => undef,
mode                  => undef,
median                => undef,
min                   => undef,
max                   => undef,
standard_deviation    => undef,
sample_range          => undef,
variance              => undef,
);

sub new
{
my \$proto = shift;
my \$class = ref(\$proto) || \$proto;
my \$self = {};
\$self->{permitted} = \%autosubs;
\$self->{data} = ();
\$self->{dirty} = 1; #is the data dirty?

bless (\$self,\$class);
print __PACKAGE__,"->new(",join(',',@_),")\n" if \$DEBUG;
return \$self;
}

{
#add data but don't compute ANY statistics yet
my \$self = shift;

#get each element and add 0 to force it be a number
#that way, 0.000 and 0 are treated the same
my \$val = shift;
while (defined \$val)
{
\$val += 0;
\$self->{data}{\$val}++;
#set dirty flag so we know cached stats are invalid
\$self->{dirty}++;
\$val = shift; #get next element
}
}

sub _all_stats
{
#compute all the stats in one sub to save overhead of sub calls
#a little wasteful to do this if all we want is count or sum for e
+xample but
#I want to keep add_data as lean as possible since it gets called
+a lot
my \$self = shift;
print __PACKAGE__,"->_all_stats(",join(',',@_),")\n" if \$DEBUG;

#count = total number of data values we have
my \$count = 0;
\$count += \$_ foreach (values %{\$self->{data}});

#uniq = number of unique data values
my \$uniq = keys %{\$self->{data}};

#initialize min, max, mode to an arbitrary value that's in the has
+h
my \$default = (keys %{\$self->{data}});
my \$max  = \$default;
my \$min  = \$default;
my \$mode = \$default;
my \$moden = 0;
my \$sum = 0;

#find min, max, sum, and mode
foreach (keys %{\$self->{data}})
{
my \$n = \$self->{data}{\$_};
\$sum += \$_ * \$n;
\$min = \$_ if \$_ < \$min;
\$max = \$_ if \$_ > \$max;

#only finds one mode but there could be more than one
#also, there might not be any mode (all the same frequency)
#todo: need to make this more robust
if (\$n > \$moden)
{
\$mode = \$_;
\$moden = \$n;
}
}
my \$mean = \$sum/\$count;

my \$stddev = 0;
my \$variance = 0;

if (\$count > 1)
{
foreach my \$val (keys %{\$self->{data}})
{
#how many times the square of the value
\$stddev += \$self->{data}{\$val} * \$val * \$val;
}
\$variance = (\$stddev - \$count*\$mean*\$mean)/(\$count - 1);
\$stddev = sqrt(\$variance);
}
else {\$stddev = undef}

#find median, and do it without creating a list of the all the dat
+a points
#if n=count is odd and n=2k+1 then median = data(k+1)
#if n=count is even and n=2k, then median = (data(k) + data(k+1))/
+2
my \$odd = \$count % 2; #odd or even number of points?
my \$even = !\$odd;
my \$k = \$odd ? (\$count-1)/2 : \$count/2;
my \$median = undef;
my \$temp = 0;
MEDIAN: foreach my \$val (sort {\$a <=> \$b} (keys %{\$self->{data}}))
{
foreach (1..\$self->{data}{\$val})
{
\$temp++;
if ((\$temp == \$k) && \$even)
{
\$median += \$val;
}
elsif (\$temp == \$k+1)
{
\$median += \$val;
\$median /= 2 if \$even;
last MEDIAN;
}
}
}

\$self->{count}  = \$count;
\$self->{uniq}   = \$uniq;
\$self->{sum}    = \$sum;
\$self->{standard_deviation} = \$stddev;
\$self->{variance} = \$variance;
\$self->{min}    = \$min;
\$self->{max}    = \$max;
\$self->{sample_range} = \$max - \$min;
\$self->{mean}    = \$mean;
\$self->{median} = \$median;
\$self->{mode}   = \$mode;

\$self->{dirty} = 0;
}

sub get_data
{
#returns a list of the data in sorted order
#the list could be very big an this defeat the purpose of using th
+is module
#use this only if you really need it
my \$self = shift;
print __PACKAGE__,"->get_data(",join(',',@_),")\n" if \$DEBUG;

my @data;
foreach my \$val (sort {\$a <=> \$b} (keys %{\$self->{data}}))
{
push @data, \$val foreach (1..\$self->{data}{\$val});
}
return @data;
}

sub frequency_distribution
{
#Compute frequency distribution (histogram), borrowed heavily from
+ Statistics::Descriptive
#Behavior is slightly different than Statistics::Descriptive
#e.g. if partition is not specified, we use uniq to set the number
+ of partitions
#     if partition = 0, then we return the data hash WITHOUT binni
+ng it into equal bins
#Why? because I like it this way -- I often want to just see how m
+any of each value I saw
#Also, you can manually pass in the bin info (min bin, bin size, a
+nd number of partitions)
#I don't cache the frequency data like Statistics::Descriptive doe
+s since it's not as expensive to compute
#but I might add that later
#todo: the minbin/binsize stuff is funky and not intuitive -- fix
+it
my \$self = shift;
print __PACKAGE__,"->frequency_distribution(",join(',',@_),")\n" i
+f \$DEBUG;

my \$partitions = shift; #how many partitions (bins)?
my \$minbin = shift; #upper bound of first bin
my \$binsize = shift; #how wide is each bin?

#if partition == 0, then just give 'em the data hash
if (defined \$partitions && (\$partitions == 0))
{
\$self->{frequency_partitions} = 0;
%{\$self->{frequency}} = %{\$self->{data}};
return %{\$self->{frequency}};
}

#otherwise, partition better be >= 1
return undef unless \$partitions >= 1;

\$self->_all_stats() if \$self->{dirty}; #recompute stats if dirty,
+(so we have count)
return undef if \$self->{count} < 2; #must have at least 2 values

#set up the bins
my (\$interval, \$iter, \$max);
if (defined \$minbin && defined \$binsize)
{
\$iter = \$minbin;
\$max = \$minbin+\$partitions*\$binsize - \$binsize;
\$interval = \$binsize;
\$iter -= \$interval; #so that loop that sets up bins works corr
+ectly
}
else
{
\$iter = \$self->{min};
\$max = \$self->{max};
\$interval = \$self->{sample_range}/\$partitions;
}
my @k;
my %bins;
while ((\$iter += \$interval) < \$max)
{
\$bins{\$iter} = 0;
push @k, \$iter;
}
\$bins{\$max} = 0;
push @k, \$max;

VALUE: foreach my \$val (keys %{\$self->{data}})
{
foreach my \$k (@k)
{
if (\$val <= \$k)
{
\$bins{\$k} += \$self->{data}{\$val};  #how many of this v
+alue do we have?
next VALUE;
}
}
}

%{\$self->{frequency}} = %bins;
\$self->{frequency_partitions} = \$partitions; #in case I add cachin
+g in the future
return %{\$self->{frequency}};
}

my \$self = shift;
my \$type = ref(\$self)
or croak "\$self is not an object";
\$name =~ s/.*://;     ##Strip fully qualified-package portion
return if \$name eq "DESTROY";
unless (exists \$self->{permitted}{\$name} ) {
croak "Can't access `\$name' field in class \$type";
}

#compute stats if necessary
\$self->_all_stats() if \$self->{dirty};
return \$self->{\$name};
}

1;

} #BEGIN

__END__

StatisticsDescriptiveDiscretized -- any ideas for a better name?

use StatisticsDescriptiveDiscretized;

my \$stats = new StatisticsDescriptiveDiscretized;
print "count = ",\$stats->count(),"\n";
print "uniq  = ",\$stats->uniq(),"\n";
print "sum = ",\$stats->sum(),"\n";
print "min = ",\$stats->min(),"\n";
print "max = ",\$stats->max(),"\n";
print "mean = ",\$stats->mean(),"\n";
print "standard_deviation = ",\$stats->standard_deviation(),"\n";
print "variance = ",\$stats->variance(),"\n";
print "sample_range = ",\$stats->sample_range(),"\n";
print "mode = ",\$stats->mode(),"\n";
print "median = ",\$stats->median(),"\n";

This module provides basic functions used in descriptive statistics.
It borrows very heavily from Statistics::Descriptive::Full with one ma
+jor
difference.  This module is optimized for discretized data (anyone
know a better word for that?) e.g. data from an A/D converter that
has a discrete set of possible values.  E.g. if your data is produced
by an 8 bit A/D then you'd have only 256 possible values in your data
set.  Even though you might have a million data points, you'd only hav
+e
256 different values in those million points.  Instead of storing the
entire data set as Statistics::Descriptive does, this module only stor
+es
the values it's seen and the number of times it's seen each value.

For very large data sets, this storage method results in significant s
+peed
and memory improvements.  In a test case with 2.6 million data points
+from
a real world application, StatisticsDescriptiveDiscretized took 40 sec
+onds
to calculate a set of statistics instead of the 561 seconds required b
+y
Statistics::Descriptive::Full.  It also required only 6MB of RAM inste
the 400MB used by Statistics::Descriptive::Full.

This module is incomplete and not fully tested.  It's currently only a
+lpha
code so use at your own risk.

Rhet Turnbull, RhetTbull on perlmonks.org

program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

Portions of this code is from Statistics::Descriptive which is under

program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

is free software; you can redistribute it and/or modify it under the
same terms as Perl itself.

Copyright (c) 1994,1995 Jason Kastner. All rights
reserved.  This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

Replies are listed 'Best First'.
Re: Faster Statistics for Discrete Data
by Dog and Pony (Priest) on Feb 21, 2002 at 17:24 UTC
Indeed, any that depend on the original order of the data can't be implemented with this method

Maybe Tie::DxHash or Tie::IxHash could be of any help with this, in case it is an issue that should be dealt with?

You have moved into a dark place.
It is pitch black. You are likely to be eaten by a grue.
Thanks for the suggestions. Unfortunately, even something like Tie::IxHash would defeat the purpose of this module. If you have to preserve order, you might as well use an array since you'd need to know where every data point came in. I suppose you could do something like run-length encoding if you had long runs of the same value but the hash overhead would probably eat up the savings for all but very limited data sets. Fortunately, there's very few statistical things (at least that I'm aware of) that depend on the order of the data (The least_squares_fit method of Statistics::Descriptive is the only one I can think of off the top of my head). There are some things that require the data to be in sorted order, and for that my method works quite well since all I have to sort is the hash keys not all the values.
Re: Faster Statistics for Discrete Data
by Everlasting God (Beadle) on Feb 21, 2002 at 04:08 UTC
My Perl skills are too rusty to comment on the code, but the proper word for 'discretized' is quantized.

'The fickle fascination of and Everlasting God' - Billy Corgan, The Smashing Pumpkins

Nope, discretized is the usual statistics terminology. Quantized has additional meanings which are not universally appropriate,

-- Anthony Staines
Really? Never heard of discretized before. Learn something every day I guess. Which additional meanings of quantized aren't appropriate here?

'The fickle fascination of and Everlasting God' - Billy Corgan, The Smashing Pumpkins
Re: Faster Statistics for Discrete Data
by rinceWind (Monsignor) on Feb 22, 2002 at 10:24 UTC
Couldn't you just tie the array and use the original package? Something like Tie::MmapArray could be used to map your array to a disk file.
Actually, tying to a file has some drawbacks. Foremost of which is that you have to have a file to tie to. I often pipe in data from another program or read in the data from a database. A tied file approach would require the overhead of writing to disk and reading back. Also, some statistics (such as median) require the data to be in sorted order -- that means you'd have to sort the file in memory or sort and write to disk again which defeats the purpose. The approach I used gets around all of those limitations -- you don't need an intermediate file and you don't have to sort the original data points.
(RhetTbull) Re: Faster Statistics for Discrete Data
by RhetTbull (Curate) on Mar 07, 2002 at 16:44 UTC
Well, after a bit of discussion here and on the module-authors list I settled on Statistics::Descriptive::Discrete. This module is now available on CPAN as Statistics::Descriptive::Discrete. I'd appreciate hearing if you find it useful (or if you find it buggy!)
--Rhet

Create A New User
Node Status?
node history
Node Type: CUFP [id://146691]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2019-12-11 15:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found

Notices?