Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Statistics::LineFit versus gnuplot, results differ

by neilwatson (Curate)
on Apr 25, 2014 at 18:06 UTC ( #1083824=perlquestion: print w/ replies, xml ) Need Help??
neilwatson has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I've created a utility using Statistics::LineFit and another using Gnuplot and fed both the same sample data. The results differ, so I must have made a mistake, but I can't see where.

Perl code

#!/usr/bin/perl use strict; use warnings; use Statistics::LineFit; use Time::Local; use Data::Dumper; my @x_axis; my @y_axes; sub date_to_epoch { my $date = shift; my ( $y, $m, $d ) = split /-/, $date; return timelocal( '59', '59', '23', $d, $m, $y ); #return timelocal( '0', '0', '0', $d, $m, $y ); } sub max_value { my @array = @_; my $max = $array[0]; for ( my $i = 0; $i <= $#array; $i++ ) { $max = $array[$i] if ( $array[$i] > $max ); } return $max; } my @epochs; while (<DATA>) { next if ( m/^#/ ); chomp; if ( my @line = split /\s+/ ) { my $epoch = date_to_epoch( $line[0] ); # factor down epoch or slope is too shallow. push @x_axis, $epoch; shift @line; for ( my $y = 0; $y <= $#line; $y++ ) { push @{$y_axes[$y]}, $line[$y] ; } } } print Dumper ( \@x_axis ); print Dumper ( \@y_axes ); my $lineFit = Statistics::LineFit->new( 0, 0 ); # TODO change 2nd to 1 $lineFit->setData( \@x_axis, \@{$y_axes[0]} ) or die "Invalid regressi +on data\n"; my ( $intercept, $slope ) = $lineFit->coefficients(); print "Slope(m): $slope Y-intercept(b): $intercept\n"; my %fitline; $fitline{y1} = $intercept; $fitline{x1} = 0; $fitline{y2} = max_value( @{$y_axes[0]} ); $fitline{x2} = ( $fitline{y2} - $fitline{y1} ) / $slope + $fitline{x1} +; print Dumper ( \%fitline ); __DATA__ # date notkept hosts 2014-04-01 50 10 2014-04-02 63 11 2014-04-03 120 12 2014-04-04 55 20 2014-04-05 60 22 2014-04-06 63 25 2014-04-07 52 24

Gnuplot

#!/usr/bin/gnuplot #set output "test.png" set title "Promises not kept" set xlabel "Date" set ylabel "Count" set rmargin 7 set border linewidth 2 set style line 1 linecolor rgb 'blue' linetype 1 linewidth 2 set style line 2 linecolor rgb 'black' linetype 1 linewidth 2 set style fill solid set xdata time set timefmt "%Y-%m-%d" set format x "%Y-%m-%d" set grid front set grid set autoscale # 1e8 reduces the epoch seconds for a less flat line. h(x) = m2 * x + b2 fit h(x) 'test.dat' using 1:3 via m2,b2 p(x) = m1 * x + b1 fit p(x) 'test.dat' using 1:2 via m1,b1 #set terminal png enhanced size 1024,768 plot 'test.dat' using 1:2 title 'Promises not kept' with boxes lc rgb +"orange", \ p(x) title 'Promise Trend' with lines linestyle 1, \ h(x) title 'Host Trend' with lines linestyle 2

test.dat

# date notkept hosts 2014-04-01 50 10 2014-04-02 63 11 2014-04-03 120 12 2014-04-04 55 20 2014-04-05 60 22 2014-04-06 63 25 2014-04-07 52 24

Perl results

$VAR1 = [ 1399003199, 1399089599, 1399175999, 1399262399, 1399348799, 1399435199, 1399521599 ]; $VAR1 = [ [ '50', '63', '120', '55', '60', '63', '52' ], [ '10', '11', '12', '20', '22', '25', '24' ] ]; Slope(m): -2.23214285714286e-05 Y-intercept(b): 31299.6785491071 $VAR1 = { 'y1' => '31299.6785491071', 'x2' => 1396849599, 'y2' => 120, 'x1' => 0 };

Gnuplot results

Final set of parameters Asymptotic Standard Error ======================= ========================== m1 = 1.44796e-07 +/- 5.823e-05 (4.022e+04%) b1 = 1 +/- 2.62e+04 (2.62e+06%) correlation matrix of the fit parameters: m1 b1 m1 1.000 b1 -1.000 1.000

Note that m1 and b1 from gnuplot are not the same as Slope and Y-intercept from Perl. Why?

Neil Watson
watson-wilson.ca

Comment on Statistics::LineFit versus gnuplot, results differ
Select or Download Code
Re: Statistics::LineFit versus gnuplot, results differ
by sn1987a (Scribe) on Apr 25, 2014 at 19:04 UTC

    Using LibreOffice to graph the data as reported by your perl program and fit a line, I get the same result as the perl module.

    I don't know enough Gnuplot to know what is going on there, but notice that the Asymptotic Standard Error is many times the values of the parameters. Event the perl data has a correlation coefficient of 0.02

    You also have an error in the time conversion in the perl code.
    return timelocal( '59', '59', '23', $d, $m, $y );
    should be
    return timelocal( '59', '59', '23', $d, $m-1, $y );.
    This will not effect the slope, just the y-intercept.

Re: Statistics::LineFit versus gnuplot, results differ
by kevbot (Hermit) on Apr 26, 2014 at 05:45 UTC

    In order to more directly compare your perl and gnuplot result, I took the epoch time values an placed them in the test.dat file.

    test.dat

    # date notkept hosts 1399003199 50 10 1399089599 63 11 1399175999 120 12 1399262399 55 20 1399348799 60 22 1399435199 63 25 1399521599 52 24
    Then I performed the fit with this simpler gnuplot script.
    !/usr/bin/gnuplot set xlabel "Date" set ylabel "Count" set autoscale p(x) = m1*x + b1 fit p(x) 'test.dat' using 1:2 via m1, b1 plot 'test.dat' using 1:2, p(x)
    I get the following:
    Final set of parameters Asymptotic Standard Error ======================= ========================== m1 = 4.65548e-08 +/- 5.823e-05 (1.251e+05%) b1 = 1 +/- 8.148e+04 (8.148e+06%)
    Adding initial guesses to the gnuplot script results in a slope value that is closer to the perl result:
    #!/usr/bin/gnuplot set xlabel "Date" set ylabel "Count" set autoscale p(x) = m1*x + b1 m1 = -1e-5 b1 = 30000 fit p(x) 'test.dat' using 1:2 via m1, b1 plot 'test.dat' using 1:2, p(x)
    Final set of parameters Asymptotic Standard Error ======================= ========================== m1 = -2.13926e-05 +/- 5.736e-05 (268.1%) b1 = 30000 +/- 8.027e+04 (267.6%)
    All curve fitting algorithms are not equal, and this suggests that the gnuplot algorithm is more sensitive to the initial guesses compared to the perl algorithm...or the perl algorithm may generate better initial guesses from the input data compared to gnuplot. This is just an educated guess, I have not looked at the code for either algorithm. If I normalize the x-axis values by dividing them by the first x value, thus giving x values that are closer in magnitude compared to the y values, I get a result that is closer to that of your perl code.
    #!/usr/bin/gnuplot set xlabel "Date" set ylabel "Count" set autoscale p(x) = m1 * (x / 1399003199) + b1 fit p(x) 'test.dat' using 1:2 via m1, b1 plot 'test.dat' using 1:2, p(x)
    Final set of parameters Asymptotic Standard Error ======================= ========================== m1 = -31227.8 +/- 8.025e+04 (257%) b1 = 31299.7 +/- 8.026e+04 (256.4%)

    Note that you need to divide the m1 value by 1399003199 to compare it to your perl results, giving -2.23214643271162e-05. So, I suspect there is a weakness in the gnuplot algorithm that shows up when the magnitudes of the x and y values are very different.

    Fitting the same data using the lm function in R gives the following output (without giving initial guesses, and without normalizing the x values):

    > x [1] 1399003199 1399089599 1399175999 1399262399 1399348799 [6] 1399435199 1399521599 > y [1] 50 63 120 55 60 63 52 > lm(y ~ x) Call: lm(formula = y ~ x) Coefficients: (Intercept) x 3.130e+04 -2.232e-05 > m <- lm(y ~ x) > summary(m) Call: lm(formula = y ~ x) Residuals: 1 2 3 4 5 6 7 -21.9286 -7.0000 51.9286 -11.1429 -4.2143 0.7143 -8.3571 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.130e+04 8.026e+04 0.390 0.713 x -2.232e-05 5.736e-05 -0.389 0.713 Residual standard error: 26.22 on 5 degrees of freedom Multiple R-squared: 0.0294, Adjusted R-squared: -0.1647 F-statistic: 0.1514 on 1 and 5 DF, p-value: 0.7132
    which is similar to your perl result. For this dataset, the standard errors are pretty high compared to the fit values. They are about 2 to 3 times the parameter values, this means that you could change the fit slope and intercept by 2 to 3 times and get a similar quality of fit. Ideally, standard errors should be smaller than your fit parameter values.
Re: Statistics::LineFit versus gnuplot, results differ
by sn1987a (Scribe) on Apr 27, 2014 at 02:11 UTC

    kevbot is on to something here. The two systems use rather different methods for fitting the data.

    Statistics::LineFit and LibreOffice are doing a standard linear least squares fit. This is not an iterative process and is not dependent on any initial guesses.

    The Gnuplot documentation for fit says that it uses a nonlinear least-squares (NLLS) Marquardt-Levenberg algorithm.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1083824]
Approved by kevbot
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2014-07-29 23:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (229 votes), past polls