Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
Perl Monk, Perl Meditation
 
PerlMonks  

Empirically solving complex problems

by oakbox (Chaplain)
on Mar 05, 2005 at 06:01 UTC ( #436861=perlmeditation: print w/ replies, xml ) Need Help??

I took calculus as part of my electrical engineering degree. The irony of majoring in EE is that I'm not that hot at math, at least abstract math. I like hard numbers that represent real things. The professor explained it like this. He graphed a function on the board and started working out how to find the area below the curve. As an aside, he started to explain that a chemist, confronted with the same problem, would simply draw the graph out 10 times and weigh the resulting graphs to arrive at the answer. Depending on the needed accuracy, that chemist might perform the graph and weigh 100 or 1000 times. After he explained this, I thought to myself, 'crap, I guess that means I should have gone with chemistry'.

Recently, I came up against a real life situation involving complex curves and the area under them. I dutifully combed the internet and dug out my old probability and statistics text book to figure out the answer to the question, "How do I find the union of two normal distributions?" (Normal distribution - the famous "bell curve").

I found many many texts, some of them starting out with just the basics and working their way up. I found lot's of complex formulas that did not do quite what I was looking for, formulas that looked like they were written in sanskrit. After 5 or 6 hours of frustration, I decided to look at the problem from a chemist's point of view: empirically.

Empirically comparing normal distributions: A normal distribution is a graph of probabilities, really just an array. Perl is great for handling this stuff, so I wrote a method that converts a mean and standard deviation into an array plotting out what a population of 100 would look like. I did this again for the second mean and SD and then compared the results. If a point appeared in both arrays, it was counted. Then, a simple look at the relationship between the number of 'common' points to the population and presto! A simple percentage of similarity!

As an added bonus, I could increase the population to arbitrarily high numbers (1000 or 10000 points) to get better resolution on the graphs and hence, more accurate percentages. I would make my script plot out the 100's of graphs and weigh them for me.

A quick search found me a GPL'd method for finding the percentage of a population at a datapoint based on the mean and SD:

sub int_gen_curve { my ($self, $x, $mean, $sdev ) = @_; my $pi = 3.14159265358979323844; return 1 / ( sqrt( 2 * $pi * $sdev * $sdev )) * exp( -($x-$mean)*($x-$mean) / ( 2 * $sdev * $sdev )) +; }
Where '$x' is a point on the x axis. Next, I wrote a simple routine for turning my mean and standard deviation into an array:
sub compare_bell_curves my ($self,$m1,$sd1,$m2,$sd2) = @_; my $upperbound = sprintf("%.1f", ($m1 + (1.75 * $sd1))); my $lowerbound = sprintf("%.1f", ($m1 - (1.75 * $sd1))); my $area = 10000; my $data; for (my $x=$lowerbound; $x < $upperbound; $x = $x + 1) { $x = sprintf("%.0f", $x); my $posarg = $self->int_gen_curve($x,$m1,$sd1); $data->[$x] = int($area*$posarg); } # do the above again for the $m2 and $sd2 # then subtract $data1 from $data, any $data->[x] # that remains positive is a point of difference
In the end, I just had to count the positive points in $data after the subtraction and find out what the percentage relationship was to the population. If there would be NO positive points ($data1 removed all of the points from $data) I would have two equal distributions.

Perl allowed me to reduce a quite complex problem into it's basic elements and solve it using real numbers, empirically. As an added bonus, I had datasets that could easily be plugged into GD::Graph, to give that extra, visual, representation of what the data said.

Does anybody else think like this, or am I just kooky?

Comment on Empirically solving complex problems
Select or Download Code
Re: Empirically solving complex problems
by Zaxo (Archbishop) on Mar 05, 2005 at 06:53 UTC

    There's an easier way. Integrals are additive. The integral of the sum is equal to the sum of the integrals.

    Abstraction is not useless.

    After Compline,
    Zaxo

      Hmm... If he wants to find the area of a union between two graphs, what you say is not entirely applicable. The area under the union is NOT the areas of each of the graphs summed, because there are overlapping portions that will be added twice.

      What you said is true, if g is one function and f is another, then I(f+g) = I(f) + I(g) but he needs a union - not a sum. And for linear functions, F(union of A and B) = F(A) + F(B) - F(intersection of A and B)

Re: Empirically solving complex problems
by BrowserUk (Pope) on Mar 05, 2005 at 08:57 UTC
    Does anybody else think like this?

    Yes!

    And I have the same feelings about a lot of other stuff, including most of CS.

    A problem described can be related to Knapsack or The Travalling Salesman Problem--correctly described as NP-hard or NP-complete--but it can still be solved in real time using Perl when the CS majors give up. I've done this many times.

    Graph theory is, IMO, an abstraction too far. Graph terminology--nodes, edges, traversal etc.--in many cases, simply serves to obscure the details of the problems they describe, and in doing so, maks things which are straight forward to understand and implement, hard or impossible to do, because you can't see the trees cos the forest is in the way:)


    Examine what is said, not who speaks.
    Silence betokens consent.
    Love the truth but pardon error.
      A problem described can be related to Knapsack or The Travalling Salesman Problem--correctly described as NP-hard or NP-complete--but it can still be solved in real time using Perl when the CS majors give up. I've done this many times.

      The thing about NP complete problems is that they don't scale. It could be you can still find solutions for, for example, 20 nodes, while for just a few more nodes, solving the same problem takes many times longer, so long that it's no longer practically usable. Your particular problem may take a few minutes to solve, while the slighhtly more complex problem takes hours, or days.

        I think that BrowserUK's point is that a problem can be related to a known hard problem, but be very doable. Or there is a variation on it which is what you really need that is doable. The CS major looks at the problem and knows the theory about why it is hard. But if you just try it, often you'll get an answer.

        Sometimes you won't. Often you'll get a fast answer to the wrong problem. But success comes often enough that it is often worth not giving up too easily.

Re: Empirically solving complex problems
by zentara (Archbishop) on Mar 05, 2005 at 13:20 UTC
    He graphed a function on the board and started working out how to find the area below the curve. As an aside, he started to explain that a chemist, confronted with the same problem, would simply draw the graph out 10 times and weigh the resulting graphs to arrive at the answer.

    Hmm that's interesting, because it ininsinuates that the technique can't be done in mathematics...but it can, and it is the basis of calculus...limits.

    So how did the "first mathemeticians", who didn't have the calculus formulas, figure out the area under the curve? Well, they did it just like the chemists. He divided the area under the curve into tall thin rectangles, of width x, and took the average value of the 2 y values at the top, got the area of the rectangles, and summed them up.

    So then someone asked, "how can we get more accuracy?". They figured out as you narrowed the x width, the average y value at the top of the rectangle, "converged" to the same value. THE LIMIT.

    It didn't take long for them to come up with the dx and dy notations, and to put it into formulas.

    So even today, there is a whole branch of engineering called "applied numerical analysis", where you write programs exactly as described above. If your equation, or conditions, are so complex, that they can't be put into a simple formula, you break it up into intervals, either spatial or time, and compute the value of each interval separately, then sum up the results. Computers are so fast, it is usually the fastest way to model something.

    I was in EE, loved the math, but dropped out to ponder the "meaning of it all" :-) Check out this story about how some people actually can see numbers as shapes....it makes you wonder about how puny our minds really are. Math Genius


    I'm not really a human, but I play one on earth. flash japh
      I was in EE, loved the math, but dropped out to ponder the "meaning of it all"

      funny, I dropped out of EE to sell porn on the internet too.
      So how did the "first mathemeticians", who didn't have the calculus formulas, figure out the area under the curve? Well, they did it just like the chemists. He divided the area under the curve into tall thin rectangles, of width x, and took the average value of the 2 y values at the top, got the area of the rectangles, and summed them up.
      While conceptually I have no problem with this explanation, please do not confuse the concept with the historical approaches to finding the area inside a curve.
      -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
        Yeah, that just goes to show, how "our world vision" is shaped by our educational processes. In my school, the origin began with DeCartes(with only footnotes to the Greeks).

        I'm not really a human, but I play one on earth. flash japh
Re: Empirically solving complex problems
by trammell (Priest) on Mar 05, 2005 at 18:37 UTC
    A similar approach is "Monte Carlo" integration--basically you generate random x,y coordinates and calculate whether they are in the limit of the area you're measuring. The ratio of the number of points inside your area to the total number of points is a measure of the ratio of areas. Very useful for integrating complicated functions.
Re: Empirically solving complex problems
by chas (Priest) on Mar 05, 2005 at 22:20 UTC
    I'm a mathematician (teach statistics among other things), but I really don't understand what you mean by: "union of two normal distributions." (what does "union" mean in this context?) Distributions are probability measures and the sum of probability measures is not a probability. If you mean that you want to find the distribution of the sum of 2 normal random variables, then that can't be done unless you know the joint distribution; if, for example they are independent (so the joint distribution is a product), then the sum is again normal with mean equal to the sum of the means of the rvs and variance equal to the sum of the variances. From the code you exhibited, it looks like that's what you meant, but then that's the answer, and I don't see what else you want to do.
    (Update: Actually, after looking at what you did some more, I think maybe you didn't want the distribution of the sum...I guess I don't understand...sorry.)
    If you can clarify, I'll try to give a better answer.
    chas
      Let me be specific :)
      I have built a psychometric testing system. Let's say I have a test that measures only one trait, Extroversion, for example. (tests usually measure several traits at the same time, but I'll try to keep this to the essentials). Now, let's say that I measure a big group of people; men, women, lawyers, sales people, technologists, various education levels, etc. And I am, over time, able to come up with some 'norms' for the various groups. The mean and standard deviation of 'Extroversion' for the group 'Lawyers' is different than the mean and standard deviation of 'Extroversion' for the group 'technologists'. (Again, this is massive oversimplification, most of psychometric test validation is in the searching for which groups differ significantly from each other)

      NOW, a group of people inside ABC Corp take this test and I'm able to derive a mean and standard deviation of 'Extroversion' scores for this new group. The goal of the excercise here is to find out which of MY norms most closely matches the ABC Corp group.

      This all leads to the purpose of my matching script. I have two groups, represented by a mean and standard deviation, and I have to find out HOW alike they are. Preferably, I want to end up with an easy to understand representation of that match, a percentage, for example.

      In the case of find out how a particular individual's scores match up, that is trivial, I actually use stanine representations of the scores. (the 1.75 sigma boundaries aren't arbitrary, they are the upper limit of a score of 8 and the lower limit of a score of 2 in the stanine scale)

        I'm certainly no expert on that kind of problem. You could search on the web for the subject "Gaussian Mixtures" and might find some relevant information. (In any case I understand what you are doing now; sorry about the confusion.)
        chas
        It sounds like what you are after is a standard statistical method called "t-test". You feed it two distributions, and it tells you how alike the two distributions are. In fact, it's built into excel (called simply "TTEST").
        ------------ :Wq Not an editor command: Wq
      It looks as though what oakbox is after is the integral of the function:
      min(p1(x),p2(x)) dx
      where p1 and p2 are the two probability distributions. That is, he's trying to find the area under both curves. Now, this isn't actually all _that_ hard, though the answer will include some calls to erf.

      Let's see.... (ten minutes of scribbling on paper later, accompanied by some looking up of things Mathworld)

      Ok, well, it's ugly, but this _should_ get the same results as the given procedure:

      use Math::Libm qw(erf erfc M_SQRT2); sub compare_bell_curves { my ($self,$m1,$sd1,$m2,$sd2) = @_; if ($sd1 > $sd2) { ($m1,$sd1,$m2,$sd2)=($m2,$sd2,$m1,$sd1); } elsif ($sd1 == $sd2) { # stupid corner case my $dist = abs($m1-$m2)/$sd1; return erfc($dist/2/M_SQRT2); } $m2 -= $m1; $m1 = 0; # Some terms omitted since $m1 = 0 my $sd2s= $sd2*$sd2; my $sd1s= $sd1*$sd1; my $A = ($sd2s - $sd1s); my $B = 2*($m2*$sd1s); my $C = 2*(log($sd1)-log($sd2))*$sd1s*$sd2s - $m2*$m2*$sd1s; my $disc = $B*$B - 4*$A*$C; my $rdisc = sqrt($disc); my $lower = (-$B - $rdisc)/(2*$A); my $upper = (-$B + $rdisc)/(2*$A); my $p1 = 0.5 + erf(($lower-$m2)/$sd2/M_SQRT2)/2; my $p2 = (erf($upper/$sd1/M_SQRT2)-erf($lower/$sd1/M_SQRT2))/2; my $p3 = erfc(($upper-$m2)/$sd2/M_SQRT2)/2; $p1+$p2+$p3; }
      Note that it took much, much longer to write this note and get the code working than to do the math. (Mostly, that was tracing down transcription errors in going from paper to code) The math itself was a matter of finding the intersections (which boils down to just solving a quadratic equation in x, albeit with messy coefficients), and then using the fact that the cumulative distribution function for a normal distribution is as given in equation 9 of http://mathworld.wolfram.com/NormalDistribution.html.

      True, there are many problems which cannot be solved or even vaguely approached analytically, but this isn't one of them.

      -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
        True, there are many problems which cannot be solved or even vaguely approached analytically, but this isn't one of them.

        There are more problems that cannot be solved by or even vaguely approached analytically by a given person than ones that cannot be solved by or even vaguely approached analytically. I think that the original poster laid out a pretty good case that he couldn't tackle this analytically, while leaving it open as to whether someone else might succeed in doing so.

        Therefore your success with an analytical approach takes nothing away from the value of the non-analytical approach in this case.

Re: Empirically solving complex problems
by jhourcle (Prior) on Mar 06, 2005 at 14:28 UTC

    As a Civil Engineering major, this was the way in which we solved for the area under any curve, if we didn't have a way to take the integral. I think the class was called 'Numerical Analysis' or 'Numerical Methods' or something like that. And of course, we used Fortran for all of it.

    Anyway, from $min to $max, given function F() and $number_of_steps, assuming it never goes negative, the integral (area, assuming it stays positive) will always be:

    $step_size = ($max - $min) / $number_of_steps; my $total = ( F($min) + F($max) ) / 2; foreach ( 1 .. $number_of_steps-1 ) { $total += F( $min + ($step_size * $_) ); } my $area = $total * $step_size;

    You would then repeat the process, each time increasing $number_of_steps until $area varies by less than whatever your required precision is.

    The only problem with fitting this to your problem is to come up with the correct F() that describes the curve that you're trying to solve for. (Of course, solving for the overlapping area of two normal curves was handled in LRFD (Least Resistance Failure Design, where you assume that the final strength of a structure is represented by a normal distribution, and the final loading is also a normal distribution, and you attempt to get the chance of failure to a particular chance... we used a series of tables for all of that, though))

Re: Empirically solving complex problems
by spongeboy (Novice) on Mar 06, 2005 at 23:45 UTC
    Slightly OT: I remember reading about a Real World empirical way of solving quadratic equations. Basically it involves a fishtank, a balance beam and some uniform shapes (a sphere, a cylinder, a cone and a parabolic cone).

    For 3x^2 + -1x + 4 = 0, the user set up, hanging each by a thread -
    -the sphere at position 4 on the beam
    -the cylinder at position -1 (with 0 being the fulcrum)
    -the cone at position 3

    Then add water and adjust strings such that sphere is submerged and the cylinder and cone are only just touching the water, and beam is balenced. Mark level of water.

    Then add water. The beam will become unbalenced. Keep adding water until it balences again. The difference in water level gives a solution for x.

    I can't remember the name of the book, but it had several other examples of analog devices.

Re: Empirically solving complex problems
by fizbin (Chaplain) on Mar 07, 2005 at 14:51 UTC
    As I've said elsewhere, I don't think this does quite what you think it does. What measure is it you want again? I assumed from your description that you want the area under both the curves, but that's not really what you're computing - it looks like you're computing the area underneath distribution 1 but above distribution 2.

    At the moment, among other things, your solution gives different answers if I switch distribution 1 and distribution 2. That doesn't seem right, given what you've said about the particular problem.

    Also, I should note that multiplying by 10000 doesn't help you any - you'd get the same results if $area were "1", since the computer is using floating point variables in these computations.

    -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
Re: Empirically solving complex problems
by Anonymous Monk on Mar 07, 2005 at 23:24 UTC

    You're not kooky if this works for you.

    Based on your descriptions of the problem, and your statement that this is indeed multivariate rather than univariate, it appears that you want to develop a classifier (i.e. pattern-recognition). You would then test that classifier with input data. One easy way to do so is a statistical construct called a support-vector machine (SVM). They can be implemented almost trivially in the e1071 package for R. An SVM implementation would have the advantage of allowing you to classify individuals rather than groups.

    HTH

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://436861]
Approved by kvale
Front-paged by Courage
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (11)
As of 2014-04-16 21:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (434 votes), past polls