I use Statistics::Descriptive a fair bit. It's a great module that's been around a long time and is well tested. However, if you need to use the functionality in Statistics::Descriptive::Full, it stores your entire data set in an array. If you have, oh, 2 million plus data points, then that array gets rather large. On a recent real-world data set that I've been analyzing, I had several sets of data with 2.6 million data points. Statistics::Descriptive crunched the data ok but it took 10 minutes and more than 400MB of memory! Since I needed to analyze a lot of data I wanted to find a better way.

### There's got to be a better way

Although my data set is large, I happen to know an interesting thing about the data: it is satellite telemetry that's discretized (is that a word?) into a 4 bit word. That means there's at most 16 possible values for each data point. All of the statistics I'm interested in can be calculated if I know what values I saw and how many times I saw each value. Aha! Sounds like a job for a hash.

So, instead of storing every data point in an array, I only store the values I've seen and the number of times I've seen them in a hash.

### Implementation

Using the hash idea, I implemented this module (named Statistics::Descriptive::Discretized for now). The data is stored in a hash instead of an array. This works very well if you have a limited number of discrete values in your data set. I've tested this with simulated 16 bit output (meaning 2^16 possible values) and it scales quite well, even with 1 million+ data points with 65,536 possible values. If your input data is not limited to discrete values then this will probably perform worse than the array method used by Statistics::Descriptive.

I've tried to keep the interface as close as possible to the Statistics::Descriptive interface. This is a rough draft and all of the routines in Statistics::Descriptive are not fully implemented yet. (Indeed, any that depend on the original order of the data can't be implemented with this method). For many purposes, this module should be a drop in replacement for Statistics::Descriptive.

### Results

I tested this module (using Statistics::Descriptive as a baseline) against several large real world data sets. Statistics::Descriptive::Discretized scales linearly and blows the socks off of Statistics::Descriptive. (I'm not knocking the excellent Statistics::Descriptive -- it's a great module! I just present an alternative that works better for certain data sets). Here are some results:
 Data points Run Time (sec)Statistics::Descriptive Run Time (sec)Discretized 100000 12 1.5 200000 24 3 300000 35 4 500000 59 7 700000 87 10 1000000 119 14 1500000 215 21 2000000 456 29 2600000 561 40

As you can see, after a million points, Statistics::Descriptive starts to scale somewhat exponentially but the Discretized version stays linear. For the test case with 2.6 million data points, this module is 14 times faster than the baseline (and it uses only a few MB of RAM while Statistics::Descriptive uses more than 400MB for this data set!)

### The Code

Here's a sample program that shows how to use it. If you have any suggestions, critiques, etc. please fire away. If this seems like a useful thing, I'll clean it up for the CPAN.

