|No such thing as a small change|
Evolving formulaeby tsee (Curate)
|on Feb 25, 2004 at 14:31 UTC||Need Help??|
If you look at formulae, algebraic expressions, the right way, you'll realize they're no more than pieces of data. They are 'functional' data that describes how one may go about doing something much like a cooking recipe. Just like an infinite number of monkeys with typewriters will eventually come up with a Shakespearean tragedy, we could try to have todays fast computers mutate a formula until it gets close to a predefined goal. A computer is probably a good approximation of a large number of monkeys anyway.
Of course, even a computer has limits and needs some pushing in the right direction. Enter evolutionary algorithms. The basic idea is to reward steps in the right direction without making small steps backwards completely contraproductive. More generally speaking, you mutate a whole matrix of formulas, judge the quality (or fitness) of a particular formula and reward it by copying it to adjacent matrix elements if its fitness is higher than that of its neighbours. Every such mutation of all matrix elements is called a generation. The program at hand pretty much does that: Mutate all formulas, find the best and reward them. With some twists and turns.
In order to work with formulas as data types, one needs some way of representing the formula as a data type a computer can work with. Naturally, one chooses to use trees after a moment of thinking because that's exactly how we evaluate a formula, too: 2 + 4 * 5 becomes
+ / \ 2 * / \ 4 5
And we calculate the result from bottom up: 2 + 4 * 5 = 2 + 20 = 22. So well, we have a matrix of trees of operators and numbers. The operators that may occurr in the formulas are all binary operators for now: Addition, subtraction, multiplication, division, and exponentiation.
So well, we have a data representation and now we need to find sensible operations on it that change it. Some operations should modify the formula significantly, but most should just change it a little bit. Here is a summary of which mutations I chose for the program at hand:
Fortunately, an implementation of the data type outlined above already exists on CPAN. Math::Symbolic does exactly that and then some. Furthermore, the module makes it easy for us to render the tree human-readable and to evaluate its value. It also has facilities of replacing a tree node with any other tree node which is required for the expanding and folding mutations. Further UI elements could make use of the fact that Math::Symbolic can convert formulas to LaTeX.
Using this module, the first implementation was quickly carried out in an afternoon. Unfortunately, the initial version did not offer much feedback about the progress and results.
In the current version, the program offers facilities to print images of the current matrix every generation. In these images, an area of n x n pixels corresponds to one cell or formula. The color of the area shows the cell's fitness with high fitness being lighter and undefined or extremely low fitness being colored with a signalling red. For example, divisions by zero would yield such a red cell. The algorithm that maps certain fitness ranges to a color still requires some tweaking and directly depends on the function that calculates the fitness of a formula.
After finishing the planned number of generations, the program prints a list of the formulas with the highest fitness.
After lots of general discussion, I'll present a simple example:
We search for a formula to calculate pi as accurately as possible. Thus, fitness is determined by a formula's distance from pi.
A sample run of 500 generations yielded the following best approximation: 3 + 1 / 10^(1426/1726) which beats the commonly memorized idiom of 22/7 in accuracy by about an order of magnitude. (22/7 approximates pi correctly to the third decimal.)
The code can be found at http://steffen-mueller.net/symevol/symevol.tar.gz and a GIF animation created from the output images can be seen here: http://steffen-mueller.net/symevol/animation.html (ca. 700kb for 25 seconds). Since the animation is rather fast, there's also a version running four times slower at http://steffen-mueller.net/symevol/animation_slow.html (also 700kb, but 100 seconds).
While the current version can only deal with constants and operators, it would be a sensible - albeit challenging - extension to include a limited set of variables in the mutations and judge the fitness of such formulas by their approximation of a function.
Furthermore, one could easily add a notion of persistence to continue simulations at any time.
The program could output the results as a nicely formatted LaTeX source file instead of just printing plain text.
(A copy of this text can be found at http://steffen-mueller.net/symevol/ )