|Pathologically Eclectic Rubbish Lister|
Memory efficient statistical distribution classby Dallaylaen (Hermit)
|on Jun 07, 2013 at 11:47 UTC||Need Help??|
Dallaylaen has asked for the
wisdom of the Perl Monks concerning the following question:
I'd like to analyse some data (say, web-service response times) and get various statistical info, mainly percentiles/quantiles and presence of outstanding values.
I know about Statistics::Descriptive, however, I don't want to store all the data in memory. On the other hand, having my results off by a few % would be fine, I only care about huge differences.
So I came up with the following idea: create an array of logarithmic buckets, and count data points landing in each bucket. Having the data spread across 6 orders of magnitude and guaranteed precision of 1% still leaves me with 6 * log 10 / log 1.01 =~ 1400 buckets which is perfectly fine (36 kb of memory, given current Perl's scalar size).
Counting percentiles is simple - just add up bucket counters until $sum exceeds $percentage * $total_count.
However, before I start writing actual code, I would like to ask which memory efficient statistical modules and algorithms already exist (for Perl, of maybe other languages).
Looks like there's a similar Stackoverflow question, and there's similar method proposed in one of the answers. Haven't found a ready-made Perl implementation, though.