"be consistent" PerlMonks

### Statistics::LineFit versus gnuplot, results differ

by neilwatson (Priest)
 on Apr 25, 2014 at 18:06 UTC Need Help??
neilwatson has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I've created a utility using Statistics::LineFit and another using Gnuplot and fed both the same sample data. The results differ, so I must have made a mistake, but I can't see where.

## Perl code

```#!/usr/bin/perl

use strict;
use warnings;
use Statistics::LineFit;
use Time::Local;
use Data::Dumper;

my @x_axis;
my @y_axes;

sub date_to_epoch
{
my \$date = shift;
my ( \$y, \$m, \$d ) = split /-/, \$date;
return timelocal( '59', '59', '23', \$d, \$m, \$y );
#return timelocal( '0', '0', '0', \$d, \$m, \$y );
}

sub max_value
{
my @array = @_;
my \$max = \$array[0];

for ( my \$i = 0; \$i <= \$#array; \$i++ )
{
\$max = \$array[\$i] if ( \$array[\$i] > \$max );
}
return \$max;
}

my @epochs;

while (<DATA>)
{
next if ( m/^#/ );
chomp;

if ( my @line = split /\s+/ )
{
my \$epoch = date_to_epoch( \$line[0] );

# factor down epoch or slope is too shallow.
push @x_axis, \$epoch;
shift @line;

for ( my \$y = 0; \$y <= \$#line; \$y++ )
{
push @{\$y_axes[\$y]}, \$line[\$y] ;
}
}
}

print Dumper ( \@x_axis );
print Dumper ( \@y_axes );

my \$lineFit = Statistics::LineFit->new( 0, 0 ); # TODO change 2nd to 1
\$lineFit->setData( \@x_axis, \@{\$y_axes[0]} ) or die "Invalid regressi
+on data\n";
my ( \$intercept, \$slope ) = \$lineFit->coefficients();
print "Slope(m): \$slope  Y-intercept(b): \$intercept\n";

my %fitline;
\$fitline{y1} = \$intercept;
\$fitline{x1} = 0;
\$fitline{y2} = max_value( @{\$y_axes[0]} );
\$fitline{x2} = ( \$fitline{y2} - \$fitline{y1} ) / \$slope + \$fitline{x1}
+;

print Dumper ( \%fitline  );

__DATA__
# date     notkept hosts
2014-04-01 50      10
2014-04-02 63      11
2014-04-03 120     12
2014-04-04 55      20
2014-04-05 60      22
2014-04-06 63      25
2014-04-07 52      24

## Gnuplot

```#!/usr/bin/gnuplot

#set output "test.png"
set title "Promises not kept"
set xlabel "Date"
set ylabel "Count"
set rmargin 7

set border linewidth 2
set style line 1 linecolor rgb 'blue' linetype 1 linewidth 2
set style line 2 linecolor rgb 'black' linetype 1 linewidth 2
set style fill solid

set xdata time
set timefmt "%Y-%m-%d"
set format x "%Y-%m-%d"
set grid front
set grid
set autoscale

# 1e8 reduces the epoch seconds for a less flat line.
h(x) = m2 * x + b2
fit h(x) 'test.dat' using 1:3 via m2,b2
p(x) = m1 * x + b1
fit p(x) 'test.dat' using 1:2 via m1,b1

#set terminal png enhanced size 1024,768
plot 'test.dat' using 1:2 title 'Promises not kept' with boxes lc rgb
+"orange", \
p(x) title 'Promise Trend' with lines linestyle 1, \
h(x) title 'Host Trend' with lines linestyle 2

## test.dat

```# date     notkept hosts
2014-04-01 50      10
2014-04-02 63      11
2014-04-03 120     12
2014-04-04 55      20
2014-04-05 60      22
2014-04-06 63      25
2014-04-07 52      24

## Perl results

```\$VAR1 = [
1399003199,
1399089599,
1399175999,
1399262399,
1399348799,
1399435199,
1399521599
];
\$VAR1 = [
[
'50',
'63',
'120',
'55',
'60',
'63',
'52'
],
[
'10',
'11',
'12',
'20',
'22',
'25',
'24'
]
];
Slope(m): -2.23214285714286e-05  Y-intercept(b): 31299.6785491071
\$VAR1 = {
'y1' => '31299.6785491071',
'x2' => 1396849599,
'y2' => 120,
'x1' => 0
};

## Gnuplot results

```Final set of parameters            Asymptotic Standard Error
=======================            ==========================

m1              = 1.44796e-07      +/- 5.823e-05    (4.022e+04%)
b1              = 1                +/- 2.62e+04     (2.62e+06%)

correlation matrix of the fit parameters:
m1     b1
m1              1.000
b1             -1.000  1.000

Note that m1 and b1 from gnuplot are not the same as Slope and Y-intercept from Perl. Why?

Neil Watson
watson-wilson.ca

Replies are listed 'Best First'.
Re: Statistics::LineFit versus gnuplot, results differ
by sn1987a (Chaplain) on Apr 25, 2014 at 19:04 UTC

Using LibreOffice to graph the data as reported by your perl program and fit a line, I get the same result as the perl module.

I don't know enough Gnuplot to know what is going on there, but notice that the Asymptotic Standard Error is many times the values of the parameters. Event the perl data has a correlation coefficient of 0.02

You also have an error in the time conversion in the perl code.
return timelocal( '59', '59', '23', \$d, \$m, \$y );
should be
return timelocal( '59', '59', '23', \$d, \$m-1, \$y );.
This will not effect the slope, just the y-intercept.

Re: Statistics::LineFit versus gnuplot, results differ
by kevbot (Priest) on Apr 26, 2014 at 05:45 UTC

In order to more directly compare your perl and gnuplot result, I took the epoch time values an placed them in the test.dat file.

test.dat

```# date     notkept hosts
1399003199 50      10
1399089599 63      11
1399175999 120     12
1399262399 55      20
1399348799 60      22
1399435199 63      25
1399521599 52      24
Then I performed the fit with this simpler gnuplot script.
```!/usr/bin/gnuplot

set xlabel "Date"
set ylabel "Count"

set autoscale
p(x) = m1*x + b1
fit p(x) 'test.dat' using 1:2 via m1, b1

plot 'test.dat' using 1:2, p(x)
I get the following:
```Final set of parameters            Asymptotic Standard Error
=======================            ==========================

m1              = 4.65548e-08      +/- 5.823e-05    (1.251e+05%)
b1              = 1                +/- 8.148e+04    (8.148e+06%)
Adding initial guesses to the gnuplot script results in a slope value that is closer to the perl result:
```#!/usr/bin/gnuplot

set xlabel "Date"
set ylabel "Count"

set autoscale

p(x) = m1*x + b1
m1 = -1e-5
b1 = 30000
fit p(x) 'test.dat' using 1:2 via m1, b1

plot 'test.dat' using 1:2, p(x)
```Final set of parameters            Asymptotic Standard Error
=======================            ==========================

m1              = -2.13926e-05     +/- 5.736e-05    (268.1%)
b1              = 30000            +/- 8.027e+04    (267.6%)
All curve fitting algorithms are not equal, and this suggests that the gnuplot algorithm is more sensitive to the initial guesses compared to the perl algorithm...or the perl algorithm may generate better initial guesses from the input data compared to gnuplot. This is just an educated guess, I have not looked at the code for either algorithm. If I normalize the x-axis values by dividing them by the first x value, thus giving x values that are closer in magnitude compared to the y values, I get a result that is closer to that of your perl code.
```#!/usr/bin/gnuplot

set xlabel "Date"
set ylabel "Count"

set autoscale

p(x) = m1 * (x / 1399003199)  + b1
fit p(x) 'test.dat' using 1:2 via m1, b1

plot 'test.dat' using 1:2, p(x)
```Final set of parameters            Asymptotic Standard Error
=======================            ==========================

m1              = -31227.8         +/- 8.025e+04    (257%)
b1              = 31299.7          +/- 8.026e+04    (256.4%)

Note that you need to divide the m1 value by 1399003199 to compare it to your perl results, giving -2.23214643271162e-05. So, I suspect there is a weakness in the gnuplot algorithm that shows up when the magnitudes of the x and y values are very different.

Fitting the same data using the lm function in R gives the following output (without giving initial guesses, and without normalizing the x values):

```> x
[1] 1399003199 1399089599 1399175999 1399262399 1399348799
[6] 1399435199 1399521599
> y
[1]  50  63 120  55  60  63  52
> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
3.130e+04   -2.232e-05

> m <- lm(y ~ x)
> summary(m)

Call:
lm(formula = y ~ x)

Residuals:
1        2        3        4        5        6        7
-21.9286  -7.0000  51.9286 -11.1429  -4.2143   0.7143  -8.3571

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.130e+04  8.026e+04   0.390    0.713
x           -2.232e-05  5.736e-05  -0.389    0.713

Residual standard error: 26.22 on 5 degrees of freedom
Multiple R-squared:  0.0294,    Adjusted R-squared:  -0.1647
F-statistic: 0.1514 on 1 and 5 DF,  p-value: 0.7132

which is similar to your perl result. For this dataset, the standard errors are pretty high compared to the fit values. They are about 2 to 3 times the parameter values, this means that you could change the fit slope and intercept by 2 to 3 times and get a similar quality of fit. Ideally, standard errors should be smaller than your fit parameter values.
Re: Statistics::LineFit versus gnuplot, results differ
by sn1987a (Chaplain) on Apr 27, 2014 at 02:11 UTC

kevbot is on to something here. The two systems use rather different methods for fitting the data.

Statistics::LineFit and LibreOffice are doing a standard linear least squares fit. This is not an iterative process and is not dependent on any initial guesses.

The Gnuplot documentation for fit says that it uses a nonlinear least-squares (NLLS) Marquardt-Levenberg algorithm.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://1083824]
Approved by kevbot
help
Chatterbox?
 [Discipulus]: thanks all [LanX]: ? [LanX]: disc+ you can also modify this  approach if you don't want recursions

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (10)
As of 2018-03-19 11:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
When I think of a mole I think of:

Results (239 votes). Check out past polls.

Notices?