<?xml version="1.0" encoding="windows-1252"?>
<node id="146691" title="Faster Statistics for Discrete Data" created="2002-02-20 22:17:42" updated="2005-07-31 20:40:05">
<type id="1042">
CUFP</type>
<author id="31063">
RhetTbull</author>
<data>
<field name="doctext">
&lt;h3&gt;Background&lt;/h3&gt;
I use [cpan://Statistics::Descriptive] a fair bit.  It's a great module that's 
been around a long time and is well tested.  However, if you need to
use the functionality in Statistics::Descriptive::Full, it stores
your entire data set in an array.  If you have, oh, 2 million plus 
data points, then that array gets rather large.  On a recent real-world
data set that I've been analyzing, I had several sets of data with 
2.6 million data points.  Statistics::Descriptive crunched the data ok
but it took 10 minutes and more than 400MB of memory!  Since I needed to
analyze a lot of data I wanted to find a better way.
&lt;P&gt;
&lt;h3&gt;There's got to be a better way&lt;/h3&gt;
Although my data set is large, I happen to know an interesting thing 
about the data:  it is satellite telemetry that's discretized (is that a word?) into a 
4 bit word.  That means there's at most 16 possible values for each data
point.  All of the statistics I'm interested in can be calculated if I 
know what values I saw and how many times I saw each value.  Aha!  Sounds
like a job for a hash.  
&lt;P&gt;
So, instead of storing every data point in an array, I only store the values 
I've seen and the number of times I've seen them in a hash.
&lt;h3&gt;Implementation&lt;/h3&gt;
Using the hash idea, I implemented this module (named 
Statistics::Descriptive::Discretized for now).  The data is stored in a hash instead of an array.  
This works very well if you have a limited number of discrete values in your
data set.  I've tested this with simulated 16 bit output (meaning 2^16 possible 
values) and it scales quite well, even with 1 million+ data points with 65,536 possible values.  If your input data is not limited to discrete values
then this will probably perform worse than the array method used by 
Statistics::Descriptive.

&lt;P&gt;
I've tried to keep the interface as close as possible to the 
Statistics::Descriptive interface.  This is a rough draft and all of
the routines in Statistics::Descriptive are not fully implemented yet. 
(Indeed, any that depend on the original order of the data can't 
be implemented with this method).  For many purposes, this module should be 
a drop in replacement for Statistics::Descriptive.

&lt;h3&gt;Results&lt;/h3&gt;
I tested this module (using Statistics::Descriptive as a baseline) against several 
large real world data sets.  Statistics::Descriptive::Discretized scales linearly 
and blows the socks off of Statistics::Descriptive.  (I'm not knocking the excellent 
[cpan://Statistics::Descriptive] -- it's a great module!  I just present an 
alternative that works better for certain data sets).  Here are some results:
&lt;BR&gt;
&lt;table&gt;
&lt;tr&gt;&lt;td&gt;Data points&lt;/td&gt;&lt;td&gt;Run Time (sec)&lt;BR&gt;Statistics::Descriptive&lt;/td&gt;&lt;td&gt;Run Time (sec)&lt;BR&gt;Discretized&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;100000&lt;/td&gt;&lt;td&gt;12&lt;/td&gt;&lt;td&gt;1.5&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;200000&lt;/td&gt;&lt;td&gt;24&lt;/td&gt;&lt;td&gt;3&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;300000&lt;/td&gt;&lt;td&gt;35&lt;/td&gt;&lt;td&gt;4&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;500000&lt;/td&gt;&lt;td&gt;59&lt;/td&gt;&lt;td&gt;7&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;700000&lt;/td&gt;&lt;td&gt;87&lt;/td&gt;&lt;td&gt;10&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1000000&lt;/td&gt;&lt;td&gt;119&lt;/td&gt;&lt;td&gt;14&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1500000&lt;/td&gt;&lt;td&gt;215&lt;/td&gt;&lt;td&gt;21&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;2000000&lt;/td&gt;&lt;td&gt;456&lt;/td&gt;&lt;td&gt;29&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;2600000&lt;/td&gt;&lt;td&gt;561&lt;/td&gt;&lt;td&gt;40&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;P&gt;
As you can see, after a million points, Statistics::Descriptive starts to scale somewhat 
exponentially but the Discretized version stays linear.  For the test case with 2.6 million data points, this module is 14 times faster than the baseline (and it uses only a few MB of RAM while Statistics::Descriptive uses more than 400MB for this data set!)

&lt;h3&gt;The Code&lt;/h3&gt;
Here's a sample program that shows how to use it. If you have any suggestions, critiques, etc. 
please fire away.  If this seems like a useful thing, I'll clean it up for the CPAN.
&lt;readmore&gt;
&lt;code&gt;
#!/usr/bin/perl

use strict;
use warnings;

my $stats = new StatisticsDescriptiveDiscretized;

#some random test data
my @data = qw(2 7 5 1 13 1 10 6 4 1 4 7 11 6 10 15 0 6 7 8);

$stats-&gt;add_data(@data);

print "count = ",$stats-&gt;count(),"\n";
print "uniq  = ",$stats-&gt;uniq(),"\n";
print "sum = ",$stats-&gt;sum(),"\n";
print "min = ",$stats-&gt;min(),"\n";
print "max = ",$stats-&gt;max(),"\n";
print "mean = ",$stats-&gt;mean(),"\n";
print "standard_deviation = ",$stats-&gt;standard_deviation(),"\n";
print "variance = ",$stats-&gt;variance(),"\n";
print "sample_range = ",$stats-&gt;sample_range(),"\n";
print "mode = ",$stats-&gt;mode(),"\n";
print "median = ",$stats-&gt;median(),"\n";

#print out the frequency distribution
print "\nvalue\t\tfrequency\n";
print "-"x20,"\n";
my %histogram = $stats-&gt;frequency_distribution(5);
print "$_\t\t$histogram{$_}\n" foreach (sort {$a &lt;=&gt; $b} keys %histogram);


BEGIN {

package StatisticsDescriptiveDiscretized;

### This module draws heavily from Statistics::Descriptive

use strict;
use warnings;
use Carp;
use vars qw($VERSION $REVISION $AUTOLOAD $DEBUG %autosubs);

$VERSION = '0.1';
$REVISION = '$Revision: 1.17 $';
$DEBUG = 0;

#what subs can be autoloaded?
%autosubs = (
  count                 =&gt; undef,
  mean                  =&gt; undef,
  sum                   =&gt; undef,
  uniq                  =&gt; undef,
  mode                  =&gt; undef,
  median                =&gt; undef,
  min                   =&gt; undef,
  max                   =&gt; undef,
  standard_deviation    =&gt; undef,
  sample_range          =&gt; undef,
  variance              =&gt; undef,
);

    
sub new
{
    my $proto = shift;
    my $class = ref($proto) || $proto;
    my $self = {};
    $self-&gt;{permitted} = \%autosubs;
    $self-&gt;{data} = ();
    $self-&gt;{dirty} = 1; #is the data dirty?
    
    bless ($self,$class);
    print __PACKAGE__,"-&gt;new(",join(',',@_),")\n" if $DEBUG;
    return $self;
}

sub add_data
{
    #add data but don't compute ANY statistics yet
    my $self = shift;
    print __PACKAGE__,"-&gt;add_data(",join(',',@_),")\n" if $DEBUG;

    #get each element and add 0 to force it be a number
    #that way, 0.000 and 0 are treated the same
    my $val = shift;
    while (defined $val)
    {
        $val += 0; 
        $self-&gt;{data}{$val}++;
        #set dirty flag so we know cached stats are invalid
        $self-&gt;{dirty}++;
        $val = shift; #get next element
    }
}

sub _all_stats
{
    #compute all the stats in one sub to save overhead of sub calls
    #a little wasteful to do this if all we want is count or sum for example but
    #I want to keep add_data as lean as possible since it gets called a lot
    my $self = shift;
    print __PACKAGE__,"-&gt;_all_stats(",join(',',@_),")\n" if $DEBUG;

    #count = total number of data values we have
    my $count = 0;
    $count += $_ foreach (values %{$self-&gt;{data}});

    #uniq = number of unique data values
    my $uniq = keys %{$self-&gt;{data}};

    #initialize min, max, mode to an arbitrary value that's in the hash
    my $default = (keys %{$self-&gt;{data}})[0];
    my $max  = $default; 
    my $min  = $default;
    my $mode = $default;
    my $moden = 0;
    my $sum = 0;

    #find min, max, sum, and mode
    foreach (keys %{$self-&gt;{data}})
    {
        my $n = $self-&gt;{data}{$_};
        $sum += $_ * $n;
        $min = $_ if $_ &lt; $min;
        $max = $_ if $_ &gt; $max;
    
        #only finds one mode but there could be more than one
        #also, there might not be any mode (all the same frequency)
        #todo: need to make this more robust
        if ($n &gt; $moden)
        {
            $mode = $_;
            $moden = $n;
        }
    }
    my $mean = $sum/$count;
    
    my $stddev = 0;
    my $variance = 0;

    if ($count &gt; 1)
    {
        foreach my $val (keys %{$self-&gt;{data}})
        {
            #how many times the square of the value
            $stddev += $self-&gt;{data}{$val} * $val * $val;
        }
        $variance = ($stddev - $count*$mean*$mean)/($count - 1);
        $stddev = sqrt($variance);
    }
    else {$stddev = undef}
    
    #find median, and do it without creating a list of the all the data points 
    #if n=count is odd and n=2k+1 then median = data(k+1)
    #if n=count is even and n=2k, then median = (data(k) + data(k+1))/2
    my $odd = $count % 2; #odd or even number of points?
    my $even = !$odd;
    my $k = $odd ? ($count-1)/2 : $count/2;
    my $median = undef;
    my $temp = 0;
    MEDIAN: foreach my $val (sort {$a &lt;=&gt; $b} (keys %{$self-&gt;{data}}))
    {
        foreach (1..$self-&gt;{data}{$val})
        {
            $temp++;
            if (($temp == $k) &amp;&amp; $even)
            {
                $median += $val;
            }
            elsif ($temp == $k+1)
            {
                $median += $val;
                $median /= 2 if $even;
                last MEDIAN;
            }
        }
    }
    
    $self-&gt;{count}  = $count;
    $self-&gt;{uniq}   = $uniq;
    $self-&gt;{sum}    = $sum;
    $self-&gt;{standard_deviation} = $stddev;
    $self-&gt;{variance} = $variance;
    $self-&gt;{min}    = $min;
    $self-&gt;{max}    = $max;
    $self-&gt;{sample_range} = $max - $min;
    $self-&gt;{mean}    = $mean;
    $self-&gt;{median} = $median;
    $self-&gt;{mode}   = $mode;

    $self-&gt;{dirty} = 0;
}

sub get_data
{
    #returns a list of the data in sorted order
    #the list could be very big an this defeat the purpose of using this module
    #use this only if you really need it
    my $self = shift;
    print __PACKAGE__,"-&gt;get_data(",join(',',@_),")\n" if $DEBUG;

    my @data;
    foreach my $val (sort {$a &lt;=&gt; $b} (keys %{$self-&gt;{data}}))
    {
        push @data, $val foreach (1..$self-&gt;{data}{$val});
    }
    return @data;
}

sub frequency_distribution
{
    #Compute frequency distribution (histogram), borrowed heavily from Statistics::Descriptive
    #Behavior is slightly different than Statistics::Descriptive
    #e.g. if partition is not specified, we use uniq to set the number of partitions
    #     if partition = 0, then we return the data hash WITHOUT binning it into equal bins
    #Why? because I like it this way -- I often want to just see how many of each value I saw 
    #Also, you can manually pass in the bin info (min bin, bin size, and number of partitions)
    #I don't cache the frequency data like Statistics::Descriptive does since it's not as expensive to compute
    #but I might add that later
    #todo: the minbin/binsize stuff is funky and not intuitive -- fix it
    my $self = shift;
    print __PACKAGE__,"-&gt;frequency_distribution(",join(',',@_),")\n" if $DEBUG;

    my $partitions = shift; #how many partitions (bins)?
    my $minbin = shift; #upper bound of first bin
    my $binsize = shift; #how wide is each bin?
    
    #if partition == 0, then just give 'em the data hash
    if (defined $partitions &amp;&amp; ($partitions == 0))
    {
        $self-&gt;{frequency_partitions} = 0;
        %{$self-&gt;{frequency}} = %{$self-&gt;{data}};
        return %{$self-&gt;{frequency}};
    }

    #otherwise, partition better be &gt;= 1
    return undef unless $partitions &gt;= 1;

    $self-&gt;_all_stats() if $self-&gt;{dirty}; #recompute stats if dirty, (so we have count)
    return undef if $self-&gt;{count} &lt; 2; #must have at least 2 values 

    #set up the bins
    my ($interval, $iter, $max);
    if (defined $minbin &amp;&amp; defined $binsize)
    {
        $iter = $minbin;
        $max = $minbin+$partitions*$binsize - $binsize;
        $interval = $binsize;
        $iter -= $interval; #so that loop that sets up bins works correctly
    }
    else
    {
        $iter = $self-&gt;{min};
        $max = $self-&gt;{max};
        $interval = $self-&gt;{sample_range}/$partitions;
    }
    my @k;
    my %bins;
    while (($iter += $interval) &lt; $max)
    {
        $bins{$iter} = 0;
        push @k, $iter;
    }
    $bins{$max} = 0;
    push @k, $max;

    VALUE: foreach my $val (keys %{$self-&gt;{data}})
    {
        foreach my $k (@k)
        {
            if ($val &lt;= $k)
            {
                $bins{$k} += $self-&gt;{data}{$val};  #how many of this value do we have?
                next VALUE;
            }
        }
    }

    %{$self-&gt;{frequency}} = %bins;
    $self-&gt;{frequency_partitions} = $partitions; #in case I add caching in the future
    return %{$self-&gt;{frequency}};
}

sub AUTOLOAD {
    my $self = shift;
    my $type = ref($self)
        or croak "$self is not an object";
    my $name = $AUTOLOAD;
    $name =~ s/.*://;     ##Strip fully qualified-package portion
    return if $name eq "DESTROY";
    unless (exists $self-&gt;{permitted}{$name} ) {
        croak "Can't access `$name' field in class $type";
    }

    print __PACKAGE__,"-&gt;AUTOLOAD $name\n" if $DEBUG;

    #compute stats if necessary
    $self-&gt;_all_stats() if $self-&gt;{dirty};
    return $self-&gt;{$name};
}

1;

} #BEGIN

__END__

=head1 NAME

StatisticsDescriptiveDiscretized -- any ideas for a better name?

=head1 SYNOPSIS

  use StatisticsDescriptiveDiscretized;
  
  my $stats = new StatisticsDescriptiveDiscretized;
  $stats-&gt;add_data(1,10,2,0,1,4,5,1,10,8,7);
  print "count = ",$stats-&gt;count(),"\n";
  print "uniq  = ",$stats-&gt;uniq(),"\n";
  print "sum = ",$stats-&gt;sum(),"\n";
  print "min = ",$stats-&gt;min(),"\n";
  print "max = ",$stats-&gt;max(),"\n";
  print "mean = ",$stats-&gt;mean(),"\n";
  print "standard_deviation = ",$stats-&gt;standard_deviation(),"\n";
  print "variance = ",$stats-&gt;variance(),"\n";
  print "sample_range = ",$stats-&gt;sample_range(),"\n";
  print "mode = ",$stats-&gt;mode(),"\n";
  print "median = ",$stats-&gt;median(),"\n";

=head1 DESCRIPTION

This module provides basic functions used in descriptive statistics.
It borrows very heavily from Statistics::Descriptive::Full with one major
difference.  This module is optimized for discretized data (anyone 
know a better word for that?) e.g. data from an A/D converter that
has a discrete set of possible values.  E.g. if your data is produced
by an 8 bit A/D then you'd have only 256 possible values in your data 
set.  Even though you might have a million data points, you'd only have
256 different values in those million points.  Instead of storing the 
entire data set as Statistics::Descriptive does, this module only stores
the values it's seen and the number of times it's seen each value.

For very large data sets, this storage method results in significant speed
and memory improvements.  In a test case with 2.6 million data points from
a real world application, StatisticsDescriptiveDiscretized took 40 seconds 
to calculate a set of statistics instead of the 561 seconds required by
Statistics::Descriptive::Full.  It also required only 6MB of RAM instead of 
the 400MB used by Statistics::Descriptive::Full.

=head1 NOTE

This module is incomplete and not fully tested.  It's currently only alpha 
code so use at your own risk.

=head1 AUTHOR

Rhet Turnbull, RhetTbull on perlmonks.org

=head1 COPYRIGHT

Copyright (c) 2002 Rhet Turnbull. All rights reserved.  This
program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

Portions of this code is from Statistics::Descriptive which is under
the following copyrights:

Copyright (c) 1997,1998 Colin Kuskie. All rights reserved.  This
program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

Copyright (c) 1998 Andrea Spinelli. All rights reserved.  This program
is free software; you can redistribute it and/or modify it under the
same terms as Perl itself.

Copyright (c) 1994,1995 Jason Kastner. All rights
reserved.  This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.

&lt;/code&gt;
</field>
</data>
</node>
