A lesson in statistics

0xbeef has asked for the wisdom of the Perl Monks concerning the following question:

Dear Stat Monks,

I am a fool, and strictly this problem is not perl, but lack of statistical knowledge. I apologise for that... I present the following simplest of problems:

vmstat 1 10 extract:
po fr
0  0
0  0
150  10
0  0
0  0
[download]

I wish to calculate the ratio po:fr for this. If it exceeds 15:1, make some printf noise. My current solution is:


my $tlsamples = @series_po; # = @series_fr
return 0 if ($tlsamples == 0);
my $sum_po = sum(\@series_po); # = 150
my $sum_fr = sum(\@series_fr); # = 10
 
$sum_fr = 1 if ($sum_fr == 0);
my $avg_po = $sum_po / $tlsamples; # =150 / 5 = 30
my $avg_fr = $sum_fr / $tlsamples; # = 10 / 5 = 2

$avg_fr = 1 (if $avg_fr == 0); # avoid div/0
my $pofr = $avg_po / $avg_fr; # = 15
[download]

This result of 15:1, is the same as for the following series:

po  fr
150 10
[download]

The problem is, I need the zeroes to be significant in the first series, since they are. A single value spike should not be able to cause an alert, given many other zero values! (where 0 = no activity in vmstat context)

I have zero (pun intended) statistical background. I have thought of substituting each zero value to its nearest least-signicant alternative e.g.

po fr
150 10
1 1
1 1
1 1
1 1
[download]

In this case the ratio works out to (154/5) / (14/5) = 11. Is there a correct statistical perl-friendly approach that provides significance to the zeroes in the series?

Niel

Comment on A lesson in statistics Select or Download Code

Replies are listed 'Best First'.
Re: A lesson in statistics (no, specs) by tye (Sage) on Mar 20, 2007 at 01:47 UTC
For context: po Pages paged out fr Pages freed per second Since po=1 and fr=0 is more than a million times "worse" than your "15 times" threshold and yet I really doubt it represents a situation that you want to be worried about, I think your "15 times" criteria is not enough. Your problem sample data shows samples where every single sample has `fr <= 15*po` so, of course, it fires the "15 times" alarm. That problem is more with your choice of alarm criteria than with your arithmetic. If your "15 times" does a good job even for quite large values (it certainly doesn't for very small values), then perhaps you just need to add a minimum criterion. Forcing fr=1 as a minimum is a fine way of saying that `pr < 15` is never alarming. So if pr stays at 14 for many samples while fr stays at 0 for many samples, is that indicative of a problem? It goes off the scale for your stated "15 times" criteria. But it never reaches the criteria if you set a minimum of 1 for fr. Is po=300,fr=15 really much more worrying than po=3000,fr=250 ? So play with some more data and figure out criteria that better represent the situation you are worried about than just "15 times". - tye	[reply] [d/l] [select]
Re^2: A lesson in statistics (no, specs) by 0xbeef (Hermit) on Mar 20, 2007 at 05:26 UTC
Sorry for misleading you, but my initial example is bogus - I merely tried to illustrate the problem I had in requiring the zero-values to be significant in the ratio. The real-life alert is called the Thrashing Severity Ratio, and is for a po:fr ratio = 1/6 (17%). This is described by Tom Farwell in a writeup of paging spaces, and may be somewhat specific to IBM's AIX. My problem with that writeup is two-fold: 1. Periods of inactivity (0,0 values) are not given enough weight (this may lead to false positives) 2. The overall volume (po = 4k pages swapped to paging) is not considered, and low volume spikes may provide additional false positives (but NOT if sustained). I should perhaps have mentioned the actual problem from the start, but I fear the downvote of Monks who feel that this discussion is not close enough to a pure perl problem! Niel	[reply]
Re: A lesson in statistics by eric256 (Parson) on Mar 19, 2007 at 23:45 UTC
I don't know much statistics, but it seemed like the average of the last x samples would do what you want. This code behaves how I interpreted what you want, changing samples will change how much data it holds to average over, that would be a matter of preference on your part. I would think if you don't restrict the samples you'll just end up with jibberhish, assuming you are sampling some source for this data on a regular basis. If that is the case than this will tell you if at any point the last x samples averaged over 15:1. use strict; use warnings; use Data::Dumper; my @que = (); my $sample = 5; sub average_ratio { my @data = @_; my $ratio = 0; for (@data) { $ratio = $ratio + ($_->[1] != 0 ? ($_->[0] / $_->[1]) : 0); } return $ratio / scalar @data; } while(my $line = <DATA>) { chomp $line; my ($po, $fr) = split (m/\s/, $line); push @que, [$po, $fr]; shift @que if @que > $sample; my $avg = average_ratio(@que); print "Adding [$po,\t $fr]\t makes the avg_ratio: $avg\n"; print "DANGER\n" if $avg > 15; } __DATA__ 0 0 0 0 150 10 0 0 200 40 210 40 220 40 220 30 0 0 0 0 0 0 220 20 220 10 220 05 220 01 220 100 220 100 2200 100 2200 2 2200 2 0 0 2200 1 0 0 0 0 0 0 0 0 0 0 0 0 [download] ___________ Eric Hodges	[reply] [d/l]
Re^2: A lesson in statistics by 0xbeef (Hermit) on Mar 20, 2007 at 22:13 UTC
Hi Eric, Thanks for your nice example, it matches what I currently deem the best solution (with input from others here) - calculating the mean of the ratio. I'm not sure if there is a way to eliminate more false positives... but using this method a single spike will at least not cause an exception. Niel	[reply]
Re: A lesson in statistics by osunderdog (Deacon) on Mar 19, 2007 at 21:39 UTC
Perhaps something like this would work? use strict; use Statistics::Descriptive; my $uwlRatio = 15; my $poStat = Statistics::Descriptive::Sparse->new(); my $frStat = Statistics::Descriptive::Sparse->new(); while(my $line = <DATA>) { chomp $line; my ($poData, $frData) = split( m/\s/, $line); $poStat->add_data($poData); $frStat->add_data($frData); if($poStat->mean() > 0) { my $pofrRatioMean = $poStat->mean() / $frStat->mean(); if($pofrRatioMean > $uwlRatio) { print "DANGER WILL ROBINSON! PO/FR ratio out of spec!\n"; } else { print "PO/Fr ratio within spec: $pofrRatioMean\n"; } } else { print "not enough data to calculate ratio.\n"; } } __DATA__ 0 0 0 0 150 10 0 0 200 40 210 40 220 40 220 30 220 20 220 10 220 05 220 01 220 100 220 100 2200 100 2200 2 2200 2 2200 1 [download] With output like this: $perl example.pl not enough data to calculate ratio. not enough data to calculate ratio. PO/Fr ratio within spec: 15 PO/Fr ratio within spec: 15 PO/Fr ratio within spec: 7 PO/Fr ratio within spec: 6.22222222222222 PO/Fr ratio within spec: 6 PO/Fr ratio within spec: 6.25 PO/Fr ratio within spec: 6.77777777777778 PO/Fr ratio within spec: 7.57894736842105 PO/Fr ratio within spec: 8.51282051282051 PO/Fr ratio within spec: 9.59183673469388 PO/Fr ratio within spec: 7.09459459459459 PO/Fr ratio within spec: 5.85858585858586 PO/Fr ratio within spec: 9.11290322580645 PO/Fr ratio within spec: 13.4939759036145 DANGER WILL ROBINSON! PO/FR ratio out of spec! DANGER WILL ROBINSON! PO/FR ratio out of spec! [download] Hazah! I'm Employed!	[reply] [d/l] [select]
Re^2: A lesson in statistics by 0xbeef (Hermit) on Mar 19, 2007 at 21:58 UTC
Well no, since it does not provide any regard for the zero value samples. The zeroes equate to idle-ness, and should negate any quick spikes/activity. Thanks for pointing out Statistics::Descriptive though! Niel	[reply]
Re^3: A lesson in statistics by osunderdog (Deacon) on Mar 20, 2007 at 11:22 UTC
Umm, I'm pretty sure that's what average or mean does... The arithmetic mean, or mean of a set of measurements is the sum of the measurements divided by the total number of measurements. Further information can be found at: http://en.wikipedia.org/wiki/Arithmetic_mean The samples that are zero are counted, thus affecting the denominator but not the numerator. Hazah! I'm Employed!	[reply]
Re: A lesson in statistics by kyle (Abbot) on Mar 19, 2007 at 20:55 UTC
Would it help to remove every outlier from the original data set and compute after that?	[reply]
Re^2: A lesson in statistics by 0xbeef (Hermit) on Mar 19, 2007 at 21:40 UTC
I'd consider high PO:FR a statistical certainty if the majority of samples (over a predetermined fixed period) shows a 15:1 or higher ratio. I have only been looking at this extreme case, but I actually don't think you could, since the outlier is significant (it proves a real thing - that momentary thrashing is occurring) - and calculating the ratio of the average factors that in. Niel	[reply]
Re: A lesson in statistics by hangon (Deacon) on Mar 20, 2007 at 04:27 UTC
My statistics is a bit rusty, and tye has a better handle on what you're actually doing, but this might help. When working with statistics, you generally define a range of acceptable sample values, and any values outside of this range are discarded. These are called outliers. For example, below is your code modified to ignore the samples where po == 0, so they will not skew your results. my $tlsamples = @series_po; # = @series_fr my $ok_samples = 0; my $sum_po = 0; my $sum_fr = 0; for (my $i; $i < $tlsamples; $i++){ # set up any conditions to skip outliers here if($series_po[$i] == 0){ next; } # count & sum only good samples $ok_samples++; $sum_po += $series_po[$i]; $sum_fr += $series_fr[$i]; } return 0 if ($ok_samples == 0); $sum_fr = 1 if ($sum_fr == 0); my $avg_po = $sum_po / $ok_samples; # =150 / 5 = 30 my $avg_fr = $sum_fr / $ok_samples; # = 10 / 5 = 2 $avg_fr = 1 (if $avg_fr == 0); # avoid div/0 my $pofr = $avg_po / $avg_fr; # = 15 [download]	[reply] [d/l]


Perl Monk, Perl Meditation
	PerlMonks