ehdonhon has asked for the wisdom of the Perl Monks concerning the following question:
Holiday Greetings Monks!
I have an interesting situation, and I'm hoping that I
can find a very intuitive solution by reusing code rather
than writing my own hacked up code.
I have a situation where I need to analyze about
40,000 unique sets of data on a daily basis. My job is
to take each set of data (comprised of many time, value
pairs) and look for inconsistencies within that data set. The
data might be linear or exponential (if exponential, it
should be an always increasing or always decreasing slope,) and the magnitude
of the values is irrelevant, unless there is a drastic
change in magnitude at some point in the data. I analyze
each data set separately, so the only relevance to
having 40,000 sets to look at is that it can not be to slow.
I guess what I'm looking for is something that can
take a whole bunch of (x,y) pairs and try to fit that data
to some sort of line or constantly /(increasing)(decreasing)/
curve, and then let me know if there were any points that
fell outside of the given margine of error.
That probably sounds like a very specific problem, but
as I recall from my statistics classes (long, long ago),
it comes up quite frequently, so I'm hoping that somebody
knows about something that might come close to doing
something like this for me.
Thanks in advance!
Re: Seeking abnormalities in data sets.
by clintp (Curate) on Dec 27, 2001 at 01:37 UTC

In Orwant's book, Mastering Algorithms with Perl, at the end of Chapter 15 (Statistics) he talks about finding a "bestfit" straight line (linear least squares, regression line) to a set of data points  a standard y=bx+a kind of thing from HS Algebra.
Once you have that for a given range of points, I'd think it would be a small matter to find the (correlation coefficient, rtot transformation) rogues using the distance from that line to a given point in the set to see if any single point was really whacked out.
Since I'm not willing to retype the subroutines here, use the parenthetical terms above in a search engine to find a good algorithm you can transcribe to Perl. Or the MAP examples may be online somewhere at ORA (as they are for the Cookbook and other ORA publications).
update: Fixed attribution.  [reply] 
Re: Seeking abnormalities in data sets.
by termix (Beadle) on Dec 27, 2001 at 01:24 UTC

If I understand your problem correctly, you wish to do curve fitting in PERL using large ammounts of data and then detect excepts that are identified. Yes, there are a number of statistical methods to accomplish that (which I know very little about).
 The statistical modules for PERL might help. Check them out here. (or try the Math modules here).
 I know there is a book that talks about specific curve fitting examples. Ah, here it is. I believe if you know the math behind the subject, then you can create your own algorithm with the help of this book (and contribute to CPAN!).
 May be you don't have to do the curve fitting in PERL. You could use a statistics package for that if there is one that you use and have code/scripts already written for. PERL can be your data parsing and results presentation system and coordinate the work of the curve fitting program.

termix
 [reply] 
Re: Seeking abnormalities in data sets.
by scain (Curate) on Dec 27, 2001 at 02:47 UTC

Do you know that your data sets will always be either (a) linear
(ie, y= mx +b) or (b) exponential (y = Aexp(Bx))? If that's the
case, then you should be able to use linear least squares as
suggested above. Since a and b are separate issues, you would have
to try both, and deside on a case by case basis which is better.
Also, in the case of b, you can convert it to a linear problem
by taking the log of y and plotting that against x. (At least
that feels right at the moment... log(y) = log(A) + Bx... yeah,
that's it).
If your data could be of other forms, like higher order polynomials,
then you will have to try all options, and it would turn into
a slow mess, since you would have to try all of them for any
given set.
Good luck,
Scott  [reply] 
Re: Seeking abnormalities in data sets.
by toma (Vicar) on Dec 27, 2001 at 09:10 UTC

For your task PDL will be enormously
valuable. It will be worth every bit of effort to obtain and
learn it.
There is the nice PDL::Fit::Linfit
module which does a general curvefit to a linear
combination of specified functions.
PDL also has functions for selecting the inconsistent data,
creating plots, and generating statistical summaries.
The PDL module is a perl extension written in C and
FORTRAN. In my experience it is many times faster
than the equivalent routines written in pure perl.
It should be quick with 40,000 point datasets.
It should work perfectly the first time!  toma  [reply] 
Re: Seeking abnormalities in data sets.
by newbie00 (Beadle) on Dec 27, 2001 at 03:36 UTC

Hello.
First, in order to properly analyze your data, you must know within an acceptable level of confidence, that the model you are using is the appropriate model, be it linear, exponential, or other.
For example, you can use the correlation coefficient for e.g. the linear model to determine if enough of the error can be explained by that model to provide you with enough confidence that the correct model is being used (see a statistics book that contains linear and nonlinear regression techniques).
Without getting into too much detail, as a crude method, say for instance, if you don't have the background to analyze the data to the necessary degree, if you have a 'target' value for each point (e.g. in time, or other), and say, you don't want to accept data more than say, +/ 3%, you can calculate a 'band' around that 'target' data. Then you can plot your actual data along with these bands (you will have 3 curves using pointtopoint vs. fitting a regression, especially if you don't have either the tools or background necessary to determine the actual regression model each time you collect the 40,000 data points) and visually look at the data. If the actual data falls outside of this band, then you may want to look at that particular data point a little closer. That does not mean automatically exclude it, unless you have enough info to support excluding it. This method is again, considered 'crude'.
You can use e.g. Microsoft Excel to import your data into (e.g. using a commadelimited format for the data, which you can get your Perl program can create for you; you can calculate your bands either within Excel very easily (preferred to keep imported filesize to a minimum) for plotting and/or analysis. This software has statistical routines builtin. Plus, there is a book called, "Microsoft Excel 2000 Formulas" by John Walkenbach (ISBN 0764546090) that may provide you with more info for that software. Of course, there are other stats books you can use with this software.
Be cautious in using crude methods  what I mean is, don't try to read too much into the results. These types of methods are many times used to provide you with a 'direction', not conclusions.
Hope some of this helps.
Regards. newbie00
 [reply] 

Hello!
Well, IMHO what newbie00 said  calculating acceptable band for analysed values, is used rather in situations when you have an exact 'middle' value, which is the standard and required one. BTW  this is used in quality management.
If the model is linear this is ok, but in other case the only way to decide if everything went ok, is to prepare the curve computed by an exact (expected) model, stimate acceptable difference and compare those two curves ;)
IMHO Excel is only a workaround to visualize data, but AFAIK Excel it _will_not_ cooperate so easy with anything else than Execl itself :( Besides, if you are using execl, you have to decide yourself whether there are any abnormalities using a graph... Ha! So why do i need a perl program?! ;> If you don't need a graph, why do you use Excel? Excel is also some solution; everything depends on what you _really_ need.
Best regards to everyone here. tmiklas
 [reply] 

I would caution against using Excel and a correlation coefficent.
The correlation coeffecient (often referred to as r^2)
is relatively insensitive to variations in the data. A better
measure of goodness of fit is chi squared, which you can
calculate when you do a least squares fit. Check out
Numerical Recipes
chapter 15 for details on doing the calculations.
Scott
 [reply] 

