|Problems? Is your data what you think it is?|
A Beginning Guide To Evolutionary Algorithmsby blokhead (Monsignor)
|on Oct 13, 2003 at 16:59 UTC||Need Help??|
What Are They?In an evolutionary algorithm (EA), we search the solution space (the set of all possible inputs) of a difficult problem for the best solutions, but not naïvely like in a random or brute-force search. We use some biological principles to help guide the search:
Armed with just these principles, you could implement your own rudimentary (and working) EAs. You may already have implemented something like this before. However, as they say, the devil's in the details. It's important to understand how the implementation details affect your EA.
Why/When Use EAs?(Adapted from evoweb's take on the subject)
The Basics:In EAs, we work with a population of individuals, data structures that represent elements of the problem's solution space. When we talk about representation, we mean the type of internal data structure the EA uses to store the individuals (array, string, tree, neural net, etc). Representation does matter, but for the scope of this document's examples, strings and arrays will suffice. As they are easily available as native data types in any sane language, they are much easier to implement and conceptualize. The actual encoding of a solution into a data structure is called the individual's gene. We'll talk a little more about representation later.
Fitness:If our goal is to find individuals which represent "good" solutions, we should probably be a little more specific about what we mean by "good." We must have a way of scoring an individual's effectiveness for the problem. We call this measurement the individual's fitness (as in survival of the fittest). The fitness measure should reflect the characteristics you desire in a "good" solution, so the higher an individual's fitness, the better it demonstrates the traits you want. The fitness measure is always dependent on the representation you use. It sets your EA's goal.
Commonly, fitness is just a function of the individual's gene data structure. However, the fitness measure need not be a true function in the mathematical sense. It might be probablistic, or it might depend also on other members of the population. It also often involves a model or simulation of the problem, executed with the individuals of the population.
The Process:The most basic evolutionary algorithm psuedocode is rather simple:
The exit criteria sets the target for the fitness measure, but also usually includes an upper limit on the number of iterations, in case the evolution gets "stuck." A typical exit criteria might be: "stop when some individual achieves a fitness of 100, or when we have iterated 10,000 times." We'll talk more about evolution getting "stuck" later. Sticking with the biology jargon, each iteration of the loop is called a generation.
Selection and replacement grant breeding rights and cause extinction within the population, respectively. They are independent of the representation scheme, and should only rely on your choice of fitness measure. Usually a small fraction of the population are chosen for breeding or replacement each generation. For simplicity, often the same number of individuals are chosen for breeding and replacement, although this is not required (causing the population to change in size). Here are a few of the most common selection and replacement methods:
Recombination (or breeding) is the process of using existing pairs of "parent" genes to produce new "offspring" genes. The details of this operation depend on your representation scheme, but by far the most common recombination operation is called crossover. Crossover can be used with string and array representations. It involves making copies of the parents and then swapping a chunk between the copies. Here's a visual example on two string genes:
The concept of crossover can be extended and used in other representations as well. For instance, a crossover operation on two tree structures might involve the exchange of two subtrees. Common variations on crossover include swapping chunks from different parts of the two genes or exchanging more than one chunk.
Mutation is a random process which slightly modifies the gene of an individual. With string genes, a mutation usually consists of changing a fixed number of characters, or changing each character with a very low probability (e.g, a 5% chance of changing each character). Other interesting mutations include lengthening, shortening, or modifying the gene, each with a respective probability.
A "Hello World" Example:It's time for an example! The "Hello World" of EAs is called ONE-MAX. The problem is to produce a string of all 1s, starting from a population of random binary strings. This may seem silly since we already know the final answer, but the same could be said for "Hello World" programs in programming languages. In this case, it's the EA process and the concepts, not the final answer, that are most important. Using a string representation for the genes, the code is as follows:
Can you pick out which parts of this code correspond to the parts of the pseudocode? What type of mutation was used (N-point or probabalistic)? What type of selection and replacement scheme were used? What percentage of the population gets breeding rights at each generation? What is the exit criteria? How could this code be made more efficient (there are many ways)? How could the EA process be modularized? How much harder would this have been to write in C or Java? ;)
Now What? How Do I Choose?You now probably have a feeling for the wide range of EA building blocks. But there are so many, how will you choose what's best for a particular problem? What makes them different? It's time for a little theory...
Fitness Landscapes & Diversity:One way to think of how EAs solve problems is through hill-climbing. Think of breeding as a process of exploring the solution space: starting with high-fitness individuals, recombination and mutation bring new individuals into the population, whose genes are "nearby" the genes of the parents. Selection and replacement fuel the up-hill part: the new individuals who have a higher fitness will in turn be explored while the lower ones will eventually be discarded and so on, until you discover individuals that have the highest fitness of all nearby individuals -- they are at the top of that particular "hill" of fitness. Notice that "nearness" of other individuals is measured in the number of mutations and/or recombinations needed to get from here to there. So your choice of mutation and recombination operators determines the fitness landscape.
On one hand, hill-climbing casues EA populations is to slowly cluster near the tops of these hills as they try to achieve maximum fitness. When most of the population's members are very close to one another (very few mutations or crossovers apart), their genes are very similar, they have much genetic material in common, and we say the population is not diverse. Hill-climbing is desired (we do want to maximize fitness after all), but only in moderation. If it happens too fast, it's easy for the whole population may become "stuck" on a small number of fitness hills that are not the highest in the solution space. Mathematically speaking, these are local optima.
On the other hand, when the population is diverse and spread out in the landscape, you may combine two "distant" parents to get a child somewhere in the middle, maybe on a new fitness hill. This allows for more fitness hills to be discovered, reducing the chance of getting stuck on a local optima.
(You may have noticed that in the ONE-MAX example, there are none of these. There's only one fitness hill, with the string of all 1s at the top. Its fitness landscape is a 20-dimensional hypercube. Mutation moves along one or more edges of the cube, and crossover moves to any vertex along the subcube induced by the parents. Non-trivial problems generally have fitness landscapes that are too complex to characterize.)
Here is how diversity is affected in general by the different operations:
Representation Matters, Too!Mutation and recombination (and therefore the fitness landscape) rely on your choice of representation scheme. The representation should therefore make mutation and recombination behave like the biological concepts they represent. For instance, a small change in an individual's gene should make only a small to moderate change in its fitness characterstics. Likewise, combining parts of the genes of two individuals should produce an individual that shares some of its parents' characterstics. However, the result need not be merely an average of the parents; there may be synergy between different parts of the genes.
In solving difficult problems with EAs, finding a good representation scheme with good recombination and mutation operations can often be the hardest piece of the puzzle. There is no magic advice for choosing the "right" representation, and in addition to adhering to these guidelines, the choice must be feasible to implement.
Some Final NotesEAs are a lot of fun, especially when you are modelling some real-world situation. Because at times EAs seem to work like magic, it's easy to forget that they have the same limitations (in terms of complexity) as any computer model. Do expect interesting results, but don't expect sentient digital lifeforms to crawl out of the primordial ooze of your 1s and 0s.
I think you'll enjoy working with evolutionary algorithms, as they're a bit of a departure from classical computation/analysis methods. Hopefully this guide will give you the background needed to have a lot of fun tinkering with EAs. Be creative and inventive, and happy evolving!
Update: fixed moved evoweb link (thanks atcroft)