I'm presently working on a programming project (of which the programming language shall remain nameless :-), and I've reached a point where my education and/or my creativity does not suffice to come to a satisfactory solution.

Dear friends! Say you are given a table with 4 columns. Each column represents a variable in a continuous mathematical function. So, say the function takes 3 independent variables and its result is the dependent variable. The function is one-to-one-to-one-to-one, or three-to-one, or something like that, so if you are given any three of the variables, you should be able to solve for a specific value for the 4th variable. Unfortunately, this function is not known to us. Instead, we are given a table containing a limited set of input and results of the function.

In case that description is not clear, hopefully this example will clarify. Say the table looks something like this:

```x | y
-----
3 | 4
4 | 5
[download]```

We do not know the function, but we do know that for those values of x, we get those values of y. However, we do not know what we get for y if x were to equal 3.5. Not necessarily an accurate estimate, but surely not an unreasonable one, of the resulting value is 4.5 because the value of x was directly in between 3 and 4 so we take the value in the middle of 4 and 5 for the y value. Unfortunately, as we add more independent variables, the relationship between them becomes increasingly obfuscated given such a data set.

So I think you can guess where I am going with this ...

Does anyone know of a method for coming up with such reasonable estimates for those more complex functions and data tables, if such a method that would be reasonably implementable in a language like Perl exists? The value 1 can be an estimate for any set of values, but certainly something that tends to approach reality is favorable. This problem busted my brain for a few hours before I decided to ask someone else.

One thing I tried is if one of the given points (given by a set of inputed variables) from the table is not close enough to the required point, the midpoints between the points are calculated and those become treated as "given" points. This is repeated until something close arises. This wouldn't be bad if the number of midpoints didn't multiply so quickly. I tried limiting the number of "given" midpoints for each loop to those that are closest, but then issues with the domain arose so that the desired point was no longer within the "given" set.

Basically, this comes down to drawing or estimating the curve of a continuous function given a set of points. The example I gave is a reasonable solution to the problem in two dimensions. Is there any good or decent way to expand this to more dimensions? 3, 4, 5, and so on? Visually, it doesn't seem like too difficult of a problem, but mathematically, especially in the realm of programming, it may be ..

Thanks for any thoughts, ideas, or insights!

Zenon Zabinski | zdog | zdog@perlmonk.org

Replies are listed 'Best First'.
Re: Estimating continuous functions
by Itatsumaki (Friar) on Mar 29, 2004 at 07:33 UTC

There are two general approaches to this kind of problem. You can either try to estimate the underlying function from the data that you are given, or you can simply try to generate a smooth curve between the data-points. There are (naturally) advantages and disadvantages to each approach. :)

Predicting the function is a more complex business, and usually requires some simplifying assumptions about the type of relationship between the dependent variable (the output) and the various independent variables (inputs). For instance, you can assume that the relationship is linear, or quadratic, or logarithmic, or... just about anything else. However that assumption generally has to be made up front.

You then perform a statistical operation called a "regression" on the data, using the relationships that you defined above. The regression attempts to estimate the "importance" of each parameter (as measured by a set of constant coefficients) by minimizing the deviations between the predicted and actual values. Actually you typically minimize the squared-deviations. To get an estimate of the "goodness-of-fit" you would take a look at the "residuals": how much of the actual data does your model explain?

This is really only a rough overview of regression. You have multiple dependent variables, so you are going to want a "multiple regression". The biggest problem with regression is that it takes a lot of data to accurately estimate a sophisticated model. So, if you have n independent variables (dimensions), and you assume that: i) the variables are completely orthogonal (e.g. the independent variables are totally unrelated) and ii) there is a linear relationship between each independent variable and the dependent variable then you will need at a minimum n independent observations to be able to estimate the model.

Overall, regression thus works really well when you have a lot of data, and can make reasonable estimates about the relationships between the dependent and independent variables. One nice thing about regression techniques is that they can (relatively) seamlessly handle categorical (non-numeric) data like {TRUE/FALSE} or {RED/GREEN/YELLOW}.

Your other option is to use a interpolating method. These methods do not attempt to directly model the interactions between dependent and independent models. Instead, they simply attempt to find the best smooth-fitting curve for the data. The two most important methods in use today are cubic-spline-based interpolation and fourier-approximations.

Cubic splines essentially fit cubic functions to subsets of your data. That is, a separate cubic function is fitted (regressed, actually) between each pair of data-points. This regression uses some (but not all!) of the surrounding data points. So if you have a continuous set of say 100 observations, you might estimate a cubic function between points 50 and 51 using only the data-points from 45 to 55. It turns out that this local interpolation handles discontinuities and large abrupt changes much better than global interpolations. Of course this (can) come at the price of much more computational overhead.

Fourier approximation attempts to estimate your data with a set of sinuisoidal functions (usually just sine and cosine). You can essentially approximate or fit any function (even whacked out engineering things like the heavy-side or weirdo physics stuff like the Dirac delta) with an infinite series of sinuisoidal functions. The theory behind Fourier approximations is a bit complex, but at some level you can simply think of it as regressing a large set of sine/cosine functions.

So... how are you supposed to chooose between these alternatives? The first two questions you should ask yourself are:

1. Do you need to understand the dependency structure between your dependent and independent variables? Do you already have some understanding of it that you would like to incorporate into your estimation?
2. Do you simply need to interpolate amongst the data-points that you already have, or will you need to extrapolate to new and untested values of the dependent and/or independent variables?

If you wish to understand or exploit the dependency structure amongst variables, then you will want to use a regression approach. On the other hand, if you are primarily interested in interpolation, you will almost certainly just need a curve-fitting method. Your choice gets a little trickier if you have no desire/ability to understand the relationships amongst variables, but need to extrapolate: both approaches can be used in that case, but with serious concerns about accuracy.

If you are using a curve-fitting approach, cubic splines are much easier to implement yourself and much more intuitive than Fourier transforms. If you data is inherently oscillatory, however, you might get a lot of value out of learning the details behind Fourier approximation.

If that sounds like a whole bunch of "maybes" and "possiblies" that's what it is. There are tons of numerical methods, and there are lots of application-specific trade-offs that can be made on complexity, computational efficiency, generality, and so forth.

Finally, here are a bunch of links on this stuff you might find helpful. When I first learned numerical methods I used Chapra and Canale, which was a good general book that suited people w/o a lot of mathematical background. You might want to check that out.

Good luck!

-Tats

Update: s/durve/curve/; thanks to PodMaster

Re: Estimating continuous functions
by hossman (Prior) on Mar 29, 2004 at 07:38 UTC

As I understand it, this is generally considered a "doozy" of a problem. You have to accept the fact, that for any sets of points, there are an infinte number of curves that fit them, and except in simple cases, there's usually no way to determine that one curve is "more correct" then another -- and in the really simple cases, it's even worse (consider the points where y=cos(x) and y=sin(x) intersect, then consider how man even simpler curves all go through the same points).

There do however seem to be a few papers online that discuss how you can approximate it if you're willing to make ceratin assumptions. The math is way over my head, but knock yourself out...

Re: Estimating continuous functions
by dragonchild (Archbishop) on Mar 29, 2004 at 12:33 UTC
It is a doozy, but it's also a very interesting doozy. There are many, many programs that have been written that will do this. One of the best, imho, is Mathematica. (At least, it was in '93 when I was an aspiring math major.) If you can use it, I'd suggest looking at Math::ematica. (If you don't have it and you're being paid for this work, indicate to your paychecksource that they can shell \$ for Mathematica or \$\$\$ for you to replicate it.)

------
We are the carpenters and bricklayers of the Information Age.

Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

Re: Estimating continuous functions
by tilly (Archbishop) on Mar 30, 2004 at 03:16 UTC
Several people have pointed out that this is a very hard problem to get right. There is no perfect answer, and tremendous amounts of energy has been spent on finding pretty good ones.

But nobody has given you a simple to implement "OK" answer. Not great. Just something that can readily be implemented which gives an answer that you can defend as somewhat reasonable. Depending on the application, this is often enough.

Here is an outline of one.

Let's say that the last variable a function of the rest, call it f. Make f(x_1, x_2, ... , x_(n-1)) into the weighted average of the known values of f at the points that you have. A reasonable weighting is that a point is weighted with weight 1/(square of distance from that point to the spot you're estimating). Then you can do it something like this (with some error checking added, of course...):

```sub make_estimator {
return sub {0} unless @_; # You might want to die?
my @points = @_;
return sub {
my \$weight_total;
my \$weighted_sum;
foreach my \$point (@points) {
my \$dist_squared = 0;
foreach my \$coord (0..\$#_) {
\$dist_squared += (\$point->[\$coord] - \$_[\$coord])**2;
}
if (0 == \$dist_squared) {
return \$point->[-1];
}
else {
my \$weight = 1/\$dist_squared;
\$weight_total += \$weight;
\$weighted_sum += \$weight * \$point->[-1];
}
}
return \$weighted_sum/\$weight_total;
};
}

my \$estimator = make_estimator(
[3, 4],
[4, 5],
);

print \$estimator->(3.5);
[download]```
This works perfectly well. The resulting function is trivially smooth everywhere except at the estimating points, and I'm reasonably surepositive that it is smooth there as well. It isn't linear or particularly simple though.

Feel free to play with the weighting function.

Update: I verified it. The function is also smooth at the estimating points. Its derivative at all estimating points is 0 (ie each is a maxima, minima, or inflection point).

Cool.

I guess that's the essence of that two dimensional example that I gave. The difference being that the program does not have to determine which points it takes into account. It just uses all of the data points, and the weighting allows it to do so. That's also its weakness in some ways .. Correct? .. that it considers all data points to an extent no matter how removed they are... or for some other reason?

The fact that it isn't linear isn't necessarily a bad thing, though, is it? Depends on the type of function, I guess. I suppose another aspect that adds to its simplicity is that it doesn't require you to select whether the function is linear, quadratic, logarithmic, etc, before approximating it.

When (if?) I get a chance, I'll generate some graphs using the data set (which I actually don't have yet :-), varying the weighting, to see how the estimation looks visually. Perhaps I'll post a sample of them up here later on.

Thanks!

Zenon Zabinski | zdog | zdog@perlmonk.org

I'd say that its biggest weakness is probably the opposite. It tends to overuse the nearest available data point resulting in flat sections and fairly sharp transitions from one point dominating to the next. This results in functions that don't look very reasonable.

Playing around with the weighting should let you achieve an acceptable trade-off between one point dominating and distant points having too much of an impact. The choice is going to be empirical however, there isn't a "best answer" to this problem.

Furthermore if you know anything about what your underlying function looks like, you would probably do a lot better to use more traditional estimation techniques. In particular if you can get samples on some kind of useful grid, one of the usual curve-fitting algorithms will be easy to calculate and should give excellent results. Two standard kinds of often-used curve-fitting algorithms are polynomial (eg cubic splines) and wavelets.

That's pretty cool. That seems almost like a loess smoother, isn't it?

Re: Estimating continuous functions
by zentara (Archbishop) on Mar 29, 2004 at 16:18 UTC
I'm trying to "pull a rabbit out of my hat" here, so read this with that in mind.

My first thoughts were Runge-Kutta methods. Do a google search for Runge-Kutta. I don't know how advanced your math skills are, (and mine are a bit rusty).

But what I would try to do is try and develope a set of differential equations which describe the differences between adjacent table entries. These can be as long as you need, and can involve "higher order differentials" with linear constants. Then try to combine the equations set and maybe integrate the whole thing to get a function out of it.

It may only give you a segmented function like it's "this if between 3 and 4", and that if between "4 and 5", etc.

I'm just guessing from your limited description, but you could setup equations with elements like

```w1 = w0 + a(dx/dw) + b(dy/dw) + c(dz/dw)
+ d(dx2/dw2) + e(dy2/dw2) + f(dz2/dw2)
w2 = w1 + a(dx/dw) + b(dy/dw) + c(dz/dw)
+ d(dx2/dw2) + e(dy2/dw2) + f(dz2/dw2)
..etc..
[download]```
where w1 and w2 etc are known values from the table. Then try to solve for a,b,d,e,f,g and integrating the whole mess. I've only shown the second order differential, but you can go as high as you need to get accuracy. ( I think the space shots use a "4th order Runge-Kutta"). That is they evaluate the velocity, rate of change of velocity(acceleration), rate of change of acceleration, and rate of rate of change of acceleration. They do this for all x,y,z. They have to account for time, and rotation of the planet too.

The Runge-Kutta method assumes you know the function to get the results. What I'm suggesting is to "reverse engineer" the equation. But that's what make computers so powerful...you can make a big sloppy list of equations to run thru to calculate a value, and the computer dosn't care.

Not that it will solve your problem, but there is a perl module for Runge-Kutta, Math::RungeKutta, here is the README The examples in the module are very cool, one is an ascii animation of plantary orbits.

The problem with solving these types of problems is the "time factor", if you think long enough on it you will eventually come to the best answer. The trouble is it may take "years of contemplation", which dosn't cut it for homework assignments. :-)

I'm not really a human, but I play one on earth. flash japh
Your math is definitely rusty.

Runge-Kutta techniques are useful for solving a known set of differential equations numerically. This does not give you a convenient way to figure out which set of differential equations will give you a known set of points. Worse yet, many sets of equations will give you those points, and Runge-Kutta offers you no way to decide between them.

Thus while the similarity of certain key-words might make you think that there is a useful connection, there really isn't one.

Sorry...

Isn't Runge-Kutta a method for numeric integration of known curves? I think essentially you're suggesting to regress a high-order Runge-Kutta and then minimize the differential area. That's an interesting approach, but I suspect pretty computationally slow compared to directly fitting splines or something like that. I suspect the OP would be better off doing direct interpolation instead, especially because Runge-Kutta won't allow for extrapolation beyond predicted values.

One other note: remember that if you try to estimate a high-order equation without a lot of input data, you can run into over-fitting problems. Most numerical techniques get much more accurate (and slower) with lots of data

Runge-Kutta is a method for solving differential equations numerically. However, there is a mapping from integrals to differential equations, so in that respect Runge-Kutta is also a method for solving integrals numerically.

thor

Re: Estimating continuous functions
by hardburn (Abbot) on Mar 29, 2004 at 14:22 UTC

Not having as strong a math background as I would like, I jump straight to a (possibly) practical solution: Neural Nets. It'll probably either get close to the real function, or it'll generate a confused mass of code much like the one you say you have now :) If nothing else, it might be a fun diversion.

----
: () { :|:& };:

Note: All code is untested, unless otherwise stated

Or write a Genetic Algorithm (using AI::Genetic for example). Use a variety of functions as genes from which to build the genome of each 'individual'. The fitness test could use least squares.
Problem with Neural Nets is they tend to be black boxes, though they might approximate your function they can't easily tell you what it is. Unless you implement a *really* big one that can parse and generate natural language as well as your function ;)
Re: Estimating continuous functions
by husker (Chaplain) on Mar 29, 2004 at 17:13 UTC
Two "lectures" on curve-fitting:

and

Note that both of these use matrix algebra, in which case Math::Matrix may ride to your rescue.

Re: Estimating continuous functions
by flyingmoose (Priest) on Mar 29, 2004 at 16:06 UTC
What kind of function (or curve, whatever) you fit depends on your data. Can you give us a sample data set so that we can suggest appropriate regressions? If it does not follow a set order, you might have to fit a very large polynomial to it and even then it may not be accurate outside the range of the data set -- since a piecewise function is still technically a function, fitting cubic splines may be enough. It depends on the data model and what you are trying to model. Either way, there are a lot of good web references on the requisite math, so with a better understanding of the problem space, we can give you better tips.
Re: Estimating continuous functions
by johndageek (Hermit) on Mar 30, 2004 at 02:14 UTC
Love the problem!

Just a couple of twisted thoughts.

Given a set of data, our first move would be to apply a set of operators to the independent variables, such that the result equals the dependent variable for one line of input data.

Then we apply the formula to the following lines of input, one after another until we get a set of data that fails.

Having successfully negotiated all of the given data, do we assume (yes yes I know) the formula isthe correct one? Or do we need to attempt to find all (sigh) possible formulas that will work out when applied to the data set?

Either way we are left with the sticky problem of trying to know all possible fomulae that will solve a given set of data for a given set of answers. This begins to smack of cryptography.

Some series may not play fair such as:
1 1 2 3 5 8 13 21
o t t f f s s e

would the following fit in as a possible data set?
x y z D

1 4 1 3
4 1 5 4
1 5 9 5
5 9 2 6

"formula" to get x,y and z needs to be blown out for each value.
int( (\$pi*10**(\$d-3)-(int(\$pi*10**(\$d-3)))) *10**3)

Thanks for a fun problem and headache! Good luck.

Enjoy!
Dageek
Re: Estimating continuous functions
by zentara (Archbishop) on Mar 30, 2004 at 13:54 UTC
"Hey Rockie, watch me pull a rabbit out of my hat...oops wrong hat". :-)

If the simultaneous differential equations approach didn't fly well, how about using "fourier transformations?". Since no one has mentioned it yet.

I had a teacher once who claimed you could use fourier analysis to get an "equation of your handwritten signature". He never demonstrated it, but it would seem that it would be done from taking datapoints of the signature.

I just did a groups.google.com search for "fourier equation from data points" and it seems to be a well discussed problem. One of the solutions said to use "Legendre polynomials."

I'm not really a human, but I play one on earth. flash japh
Hmm, thats kind of interesting. Pretty cool.
Re: Estimating continuous functions
by tsee (Curate) on Apr 02, 2004 at 17:31 UTC

If you're trying to approximate a function f: R -> R with a polynomial, which is a pretty limited approach but usually a good start, you could have a look at Math::Approx and Math::Approx::Symbolic.

Steffen

Re: Estimating continuous functions
by Dr.Altaica (Scribe) on Mar 31, 2004 at 22:02 UTC
This sounds perfect for Bayesian networks. While it is fairly easy to find implimentations of Beyesian Networks most are discrete not continuous like you want.

As hossman said "..that for any sets of points, there are an infinte number of curves that fit them, and except in simple cases, there's usually no way to determine that one curve is "more correct" then another..."

With Bayesian dependence modeling you can compare the probabilities of two models. if you know the probability of your data you could get the actual probability of the model. One of the best places for reading more is the B-Course library at http://b-course.hiit.fi/library.html

```|\_/|
/o o\
(>_<)
`-'
```