http://www.perlmonks.org?node_id=562015

tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am to run statistics on a large body of time series data, tracking different types of widgets purchased over time. I want to identify widget types that are becoming more popular.

I know a little statistics, and I think what I want, at least for starters, is the "constant, slope, and error" correlation coefficients for my various distributions.

In other words, snipping from code below:

# want $constant, $slope, and $error coefficients for regression equat +ion fitting this data, where the distribution line is approximated by # Y = $constant + $slope * x + $error # Y = Dependent Variable (eg, widgets purchased at point in time) # $constant = Y-axis Intercept # $slope = Slope of the regression line # x is Independent Variable(eg, time) # $error = error factor, should be large for random distributions, sma +ll for # strongly correlated distrubions # See http://www.tufts.edu/~gdallal/slr.htm #dummy for now -- what's the best way to do this?

The error factor tells me which distributions I can throw out. (Error factor will be large for random distributions.)

The other two factors wil tell me how popular the widget is in comparison with other widgets, and how quickly it is increasing (or decreasing) in popularity.

I did a little test script with distributions for "random", "increasing slowly", and "increasing quickly." (Tests fail, but concretize what I want.)

Current output is:

$ perl trend.t slow_increase distribution, constant 0, slope 0, error 0 random distribution, constant 0, slope 0, error 0 fast_increase distribution, constant 0, slope 0, error 0 not ok 1 - $slow_increase_error < $random_error # Failed test '$slow_increase_error < $random_error' # in trend.t at line 96. not ok 2 - $fast_increase_error < $random_error # Failed test '$fast_increase_error < $random_error' # in trend.t at line 97. not ok 3 - $slow_increase_slope < $fast_increase_slope # Failed test '$slow_increase_slope < $fast_increase_slope' # in trend.t at line 102. 1..3 # Looks like you failed 3 tests of 3. $
The bit that I need help with is sub calculate_regression_coefficients. Which is just dummy code right now.

Now, this is in a way a question about statistics as well as about perl. With statistics, like with perl, there's more than one way to do it: in this case, more than one method to get correlation coefficients to fit a distribution. Whatever, I just want the simplest, most vanilla, least computationally intensive way to do this... whatever that is.

There are a lot of statistics modueles on the CPAN, and I assume there's something out there that covers what I need. Can someone point me in the right direction?

Thanks in advance!

#!/usr/bin/perl use strict; use warnings; use Test::More qw(no_plan); my $distributions = { random => { distribution => { 1 => 3, 2 => 5, 3 => 2, 4 => 7, 5 => 1, 6 => 3, 7 => 2, 8 => 6, 9 => 1, 10 => 1, 11 => 3, 12 => 5, 13 => 6, 14 => 2, 15 => 8, 16 => 9, 17 => 1, 18 => 4, 19 => 5, 20 => 6 } }, slow_increase => { distribution => { 1 => 1, 2 => 1, 3 => 3, 4 => 2, 5 => 3, 6 => 2, 7 => 3, 8 => 4, 9 => 3, 10 => 2, 11 => 5, 12 => 4, 13 => 6, 14 => 5, 15 => 7, 16 => 4, 17 => 8, 18 => 6, 19 => 9, 20 => 8 } }, fast_increase => { distribution => { 1 => 2, 2 => 2, 3 => 6, 4 => 4, 5 => 6, 6 => 4, 7 => 6, 8 => 8, 9 => 6, 10 => 4, 11 => 10, 12 => 8, 13 => 12, 14 => 10, 15 => 14, 16 => 8, 17 => 16, 18 => 12, 19 => 18, 20 => 16 } } }; for my $distribution_name ( keys %$distributions ) { my $distribution = $distributions->{$distribution_name}; my $regression_coefficients = calculate_regression_coefficients($di +stribution); my ($constant, $slope, $error) = map { $regression_coefficients->{$_ +} } qw(constant slope error); print "$distribution_name distribution, constant $constant, slope $s +lope, error $error\n"; $distributions->{$distribution_name}->{constant}=$constant; $distributions->{$distribution_name}->{slope} =$slope; $distributions->{$distribution_name}->{error} =$error; } # error of random distribution should be greater than either of the ot +her two distributions my $random_error = $distributions->{random}->{error}; my $slow_increase_error = $distributions->{slow_increase}->{error}; my $fast_increase_error = $distributions->{fast_increase}->{error}; ok( $slow_increase_error < $random_error , '$slow_increase_error < $r +andom_error'); ok( $fast_increase_error < $random_error , '$fast_increase_error < $r +andom_error'); #fast increase slope should be greater than slow increase slope my $slow_increase_slope = $distributions->{slow_increase}->{slope}; my $fast_increase_slope = $distributions->{fast_increase}->{slope}; ok( $slow_increase_slope < $fast_increase_slope, '$slow_increase_slope + < $fast_increase_slope' ); # want $constant, $slope, and $error coefficients for regression equat +ion fitting this data, where the distribution line is approximated by # Y = $constant + $slope * x + $error # Y = Dependent Variable (eg, widgets purchased at point in time) # $constant = Y-axis Intercept # $slope = Slope of the regression line # x is Independent Variable(eg, time) # $error = error factor, should be large for random distributions, sma +ll for # strongly correlated distrubions # See http://www.tufts.edu/~gdallal/slr.htm #dummy for now -- what's the best way to do this? sub calculate_regression_coefficients { my $distribution = shift or die "no distribution"; {constant => 0, slope => 0, error => 0} }