Genetic Programming: A Novel Failure

Published on January 19, 2009. computer-science (5), genetic-algorithms (2)

Previously I wrote about the underpinnings of genetic algorithms, and briefly mentioned the distinction between genetic algorithms and genetic programming. Here we'll take a deeper dive into genetic programming. Before we begin our jolly stroll through a genetic programming example, I'll acknowledge bias: I believe genetic programming is a fantastic but fictional solution to a complex problem¹.

The Dream of Genetic Programming

After a long night of cleansing the internet of false opinions, you realize that you cannot singlehandedly maintain its sacred purity. You build a neural network that accepts a blog entry as input and outputs the correct action to fix the problem, but it keeps returning Weep in despair, and no amount of training has gotten the neural net to correctly emit Rant or Mock on Reddit when fed blog entries containing scintillating topics like Bitf*ck vs. Scheme, Vista Sucks or Git Owns Perforce.

Head drooped with resignation, you think to yourself, if only I could write a program that would write this program for me! Wait--for this is truly brilliant--if I just feed a list of operations into a genetic algorithm and then...

Two weeks later the newspaper--the one which hasn't yet succumbed to bankruptcy--is heralding your achievement: you, the program you programmed, and the program the program you programmed programmed have cleaned the internet of wrong, and the UN has ordered those nasty and irrelevant bloggers stripped of citizenship and marched into concentration camps.

Ah yes, all is right with the world.

Starting With a Slightly Simpler Problem

Regretfully, we'll need to start out with a simpler problem than salvaging the internet. Still, the problem we're going to attempt solving is a bit more complex than the one we were working on in the previous article (creating an array of N numbers that equal X when summed).

In our new problem an individual will consist of N elements randomly selected from this list (one element may be selected multiple times):

options = ('+','-','*','/','0','1','2','3','4','5','6','7','8','9')

Thus, we can create an individual with the following code.

from random import randint
def part():
options = ('+','-','*','/','0','1','2','3','4','5','6','7','8','9')
return options[randint(0, len(options)-1)]
def individual(length=5):
return [ part() for i in range(length) ]

And a population is generated like this:

def population(size=1000, length=5):
    return [ individual(length=length) for x in range(size) ]

Now let's describe the problem: we want an individual that--when its elements are evaluated by Python's eval function--equals some number X. Not all combinations of the elements form valid Python syntax (for example with N=5, then 5 5 * + 1 is such an individual), and those individuals (i.e. those whose elements trigger an exception when passed to eval) should receive a very poor fitness score. For individuals that do eval successfully, then their fitness score is their distance from the target value X.

We can formalize that definition into this fitness function:

def fitness(x, target):
    try:
        val = eval(" ".join(x))
        return target - abs(target-val)
    except:
        return -100000

Beyond these changes, the code is largely identical to our previous code implementing a genetic algorithm. The minor changes relate to our new fitness function where a high value represents strong fitness, while in the previous fitness function a high value presented poor fitness. (You can see the initial code for this problem here.)

Only requiring minimal changes is a testament to the universality of genetic algorithms; we've gone from solving an extremely simple arithmetic problem with a genetic algorithm to solving a substantially more complex genetic programming problem without making any substantial changes to the algorithm! Maybe we should call them universal problem solvers instead of just genetic algorithms.

Let's just verify our optimism real quickly by running a thousand generations.

>>> from gp2 import *
>>> populations, fitness_scores = evolve()
>>> fitness_scores[10:]
[-96200.27, -95499.98, -96000.02, -96200.006, -96300.002, -96599.83, -96599.904, -96000.06, -96500.001, -96800.001]
>>> fitness_scores[-10:]
[-96300.45, -96500.04, -96200.51, -96000.008, -96300.04, -96000.01, -97099.97, -96200.091, -96400.01, -95099.96]

Fuck. Fuck. Fuck. What the fuck happened there? The average fitness of the first ten generations is -96270.54, and the average fitness of generations 990 through 1000 is -96210.11. That isn't even statistically signifigant. I called you a universal problem solver, and then you did this shit to me? Fuck. Let me take a few minutes to recover.

After a Brief Interim With a Paper Bag

Okay, so I may have over-reacted a bit there, but you've probably already deduced the first take home message: untuned genetic algorithms can perform very poorly (i.e. straight-out fail) on some problems. In particular genetic algorithms perform exceptionally poorly when relying upon an uninformative fitness function, and our current fitness function is fantastically uninformative.

To begin with, very few of the individuals are even in valid Python syntax, and all invalid individuals receive the same fitness value of -10000. This means that our current fitness function gives poor feedback on fitness for the majority of individuals, and thus fails right out of the gate in a gruesome spectacle. That's okay though, because we can just improve on our current fitness function, right?

Let's take another look at our fitness function.

def fitness(x, target):
    try:
        val = eval(" ".join(x))
        return target - abs(target-val)
    except:
        return -100000

If the syntax is bad, then we get no feedback on how close the syntax is to being valid. If it is valid, then we do start to get useful feedback, but we're a bit stymied as the vast majority of individuals are invalid (we'll look into the ratio between valid and invalid below).

Without being able to evaluate the code we have two ways we can generate some more specific feedback:

Calculate a fitness function based on how close the code is to syntactically correct,
Calculate a fitness function based on how close the parameters are to a desired correct solution.

The first option is fairly simple in this extremely trivial example (the format must alternate between number and operation, and must begin and end with a number), but will explode into a rather formidable problem in more complex scenarios.

The second option is entirely bogus, because it means you are 'solving' a problem by teaching it an answer you already know; in that case you're neither optimizing nor solving anything, you're perpetuating a farce.

Like Dusting with a Jackhammer

Genetic algorithms are a category of optimization techniques, but looking at fitness we can see our problem: first it is attempting to solve a search problem (finding for working solutions, of whatever quality), and then trying to optimize (improving the quality of discovered solutions). Very few--if any--algorithms are going to perform well for both the search and optimization aspects of this problem.

Infact, in this example we're never even escaping the search aspect of the problem. We can make that more clear by modifying our grade_population function to count the number of syntactically valid individuals (instead of measuring the average fitness of the population).

We'll change it from this:

def grade_population(pop, target):
    total = sum([ fitness(x, target) for x in pop ])
    avg = total / (len(pop) * 1.0)
    return avg

to this (where -100000 is the fitness for an individual that has invalid syntax):

def grade_population(pop, target):
    fitnesses = [ fitness(x, target) for x in pop ]
    valids = [ x for x in fitnesses if x > -100000 ]
    return len(valids)

Now we can rerun the above experiment and with a more concrete view of how we're stuck in the search portion of this problem.

>>> from gp2 import *
>>> history, fitness = evolve(generations=1000)
>>> fitness[:10]
[37, 37, 34, 37, 33, 27, 21, 21, 33, 36]
>>> sum(b[:10]) / 10.0 # avg of first ten
31.6
>>> b[-10:]
[38, 46, 44, 49, 37, 43, 35, 39, 44, 35]
>>> sum(b[-10:]) / 10.0 # avg of last ten
41.0

Over a thousand generations the number of valid individuals did increase, but rather slowly. Out of curiosity, lets see what happens if we try increasing from one thousand to five thousand generations.

>>> history, fitness = evolve(generations=5000)
>>> fitness[:10]
[37, 42, 37, 46, 41, 40, 34, 33, 35, 34]
>>> sum(fitness[:10]) / 10.0 # avg of first ten
37.9
>>> fitness[-10:]
[30, 43, 42, 53, 44, 45, 42, 35, 37, 36]
>>> sum(fitness[-10:]) / 10.0 # avg of last ten
40.7

Yep, increasing from 1000 to 5000 generations didn't have any noticeable affect: we're definitely stuck in the search portion of this problem. Instead of simply increasing the generations, lets try tweaking the parameters used in the evolve function. Right now we are only retaining the top thirty percent of entries (and are thus creating a substantial quantity of children from crossing over successful parents), let's try upping that to retaining the top ninety percent of individuals.

>>> history, fitness = evolve(generations=1000, retain=0.9)
>>> fitness[:10]
[35, 38, 34, 34, 33, 35, 32, 37, 32, 38]
>>> sum(fitness[:10]) / 10.0
34.8
>>> fitness[-10:]
[42, 34, 36, 26, 37, 34, 31, 44, 36, 41]
>>> sum(fitness[-10:]) / 10.0
36.1

We certainly don't see any definition improvement there. What about reducing the percent of retained individuals to the top ten percent and increasing the number of generations to three thousand?

>>> history, fitness = evolve(retain=0.1, generations=3000)
>>> fitness[:10]
[35, 28, 30, 37, 38, 44, 41, 49, 37, 39]
>>> sum(fitness[:10]) / 10.0
37.8
>>> fitness[-10:]
[36, 38, 47, 43, 46, 50, 45, 47, 51, 43]
>>> sum(fitness[-10:]) / 10.0
44.6

Well, that didn't work well either. Perhaps we are going about this the wrong way. Right now we are creating individuals where N (the number of elements) is 5. Let's try reducing N to 3. (We'll keep the target number at 15.)

>>> history, fitness = evolve(individ_size=3)
>>> fitness[:10]
[162, 162, 141, 146, 140, 162, 148, 145, 152, 170]
>>> (sum(fitness[:10])*1.0) / 10
152.8
>>> fitness[-10:]
[149, 156, 157, 158, 165, 169, 174, 157, 159, 164]
>>> (sum(fitness[-10:])*1.0) / 10
160.8

With fewer combinations (3^3 instead of 5^5) we're seeing more syntactically sound individuals, but the percentage increase over generations is actually worse. Let's make one last ditch effort. Let's change the target X to 6 instead of 15, retain only the top ten percent of individuals, and also increase the number of generations to three thousand.

>>> history, fitness = evolve(retain=0.1, individ_size=3, target=6, generations=3000)
>>> fitness[:10]
[156, 147, 137, 135, 129, 153, 152, 141, 146, 149]
>>> sum(fitness[:10]) / 10.0
144.5
>>> fitness[-10:]
[160, 166, 157, 152, 163, 170, 177, 167, 153, 161]
>>> sum(fitness[-10:]) / 10.0
162.6

For an algorithm whose primary weakness is quick and myopic optimization, this lack of progress is a strong indication of what we've already grown to suspect: the genetic algorithm simply cannot overcome the search aspect of this problem.

This one example is far from sufficient justification, but I'll go a bit further and suggest that genetic programming will only solve problems when you supply a detailed outline of the solution. Genetic algorithms can optimize a solution from supplied components, but genetic programming is inadequate to both discover and optimize a solution. This is a shame, as there is a level of elegance to the concept that is hard to dismiss.

It isn't hard to see how this idea could repeatedly lure scientists to bang their heads on the problem until they contorted their cortexes and algorithms enough to believe it worked. If only the results of genetic programming justified its conceptual beauty.

(You can see revised code here.)

A Brief Response on 2009/1/20

Many commenters felt strongly that this introduction unfairly maligns genetic programming. Based on the deep anger they have expressed, it seems likely that these are individuals who are doing research in the field of genetic programming. It is always those too close to a topic, and often academics, who get irrationally upset about criticism. Certainly I'd be equally upset if someone was suggesting I had spent half a decade chasing a pipe dream.

Beyond the anger, the fundamental complaint is that I am generating the individuals using a trivial approach of randomly combining elements, rather than creating the rather complex infrastructure needed to consistently generate syntactically correct abstract syntax trees.

In the above article, I mentioned two responses to an uninformative fitness function:

Teach the fitness function how to code.
Teach the fitness function the code to solve the problem.

Although listed separately, and above the screaming voices, I don't believe these two solutions are genuinely distinct. Even if we use couch our resorting to #2 behind the premise of performing #1, we are still teaching the fitness function to create the solution we desire. If we accept that genetic algorithms are incapable of solving problems in the domain of search, how could it be otherwise? Even restricted to valid code, the problem space is extraordinarily large; we either require a true search algorithm for discovering solutions, or we provide hints on the direction in which to search.

Two final points:

The aim of these articles is introducing novice programmers to the ideas of computer science, while sticking with the Python language, and most importantly supplying full working examples that don't rely on external (i.e. non-standard) libraries. Using abstract syntax trees in the solution wasn't in the cards, and neither was dismissing genetic programming as unattainably complex or restricted to the realm of celestial researchers.
There is mention of some unspecified problem area where genetic programming is superior to genetic algorithms specifically, and perhaps other algorithms in general. Although the definition of superior is unstated, as is the problem domain, I'd be glad to have a contest along these lines:
1. A problem is chosen that is believed to be firmly within the problem domain where genetic programming is most competitive.
2. A single common programming language is decided upon (perhaps Scheme or Common Lisp?).
3. Code is written without resort to algorithm specific support libraries (i.e. a library for generating random numbers would be acceptable, but a library which supports genetic programming/genetic algorithms is not). This seems important as it reveals the true complexity of the algorithm, as opposed to concealing the complexity behind a library.
4. Any number of contestants submit solutions, along with the category of algorithm it uses.
5. The solutions and tests cases are distributed in a zipped file so we may all verify the results.
6. Anyone is free to create their own grading scale, and to write up the group results with their own bias.

Initial Full Code

from random import randint, random
def part():
options = ('+','-','*','/')
options += tuple( "%d" % x for x in range(0,10))
return options[randint(0, len(options)-1)]
def individual(length=5):
return [ part() for i in range(length) ]
def population(size=1000, length=5):
return [ individual(length=length) for x in range(size) ]
def fitness(x, target):
try:
val = eval(" ".join(x))
return target - abs(target-val)
except:
return -100000
def grade_population(pop, target):
total = sum([ fitness(x, target) for x in pop ])
avg = total / (len(pop) * 1.0)
return avg
def generation(pop, target, retain=0.3, random_select=0.05, mutate=0.01):
graded = [ (fitness(x, target), x) for x in pop ]
graded = [ x[1] for x in sorted(graded, reverse=True) ]
retain_length = int(len(graded)*retain)
parents = graded[:retain_length]
# randomly add other individuals to
# promote genetic diversity
for individual in graded[retain_length:]:
if random() > random_select:
parents.append(individual)
# mutate some individuals
for individual in parents:
if random() > mutate:
pos_to_mutate = randint(0, len(individual)-1)
# this mutation is not ideal, because it
# restricts the range of possible values,
# but the function is unaware of the min/max
# values used to create the individuals,
individual[pos_to_mutate] = part()
# crossover parents to create children
parents_length = len(parents)
desired_length = len(pop) - parents_length
children = []
while len(children) < desired_length:
male = randint(0, parents_length-1)
female = randint(0, parents_length-1)
if male != female:
male = parents[male]
female = parents[female]
half = len(male) / 2
child = male[:half] + female[half:]
children.append(child)
parents.extend(children)
return parents
def evolve(pop_size=1000, target=15, individ_size=5, retain=0.3,
generations=1000, random_select=0.05, mutate=0.01):
p = population(size=pop_size, length=individ_size)
history = [ p, ]
fit_history = [ grade_population(p, target=target) ]
for i in range(generations):
p = generation(p, target, retain, random_select, mutate)
history.append(p)
fit_history.append(grade_population(p, target=target))
return history, fit_history

Refined Full Code

from random import randint, random
def part():
options = ('+','-','*','/')
options += tuple( "%d" % x for x in range(0,10))
return options[randint(0, len(options)-1)]
def individual(length=5):
return [ part() for i in range(length) ]
def population(size=1000, length=5):
return [ individual(length=length) for x in range(size) ]
def fitness(x, target):
try:
val = eval(" ".join(x))
return target - abs(target-val)
except:
return -100000
def grade_population(pop, target):
pop_fitness = [ fitness(x, target) for x in pop ]
valid_syntax = [ x for x in pop_fitness if x > -100000 ]
valid_count = len(valid_syntax)
return valid_count
def generation(pop, target, retain=0.3, random_select=0.05, mutate=0.01):
graded = [ (fitness(x, target), x) for x in pop ]
graded = [ x[1] for x in sorted(graded, reverse=True) ]
retain_length = int(len(graded)*retain)
parents = graded[:retain_length]
# randomly add other individuals to
# promote genetic diversity
for individual in graded[retain_length:]:
if random() > random_select:
parents.append(individual)
# mutate some individuals
for individual in parents:
if random() > mutate:
pos_to_mutate = randint(0, len(individual)-1)
# this mutation is not ideal, because it
# restricts the range of possible values,
# but the function is unaware of the min/max
# values used to create the individuals,
individual[pos_to_mutate] = part()
# crossover parents to create children
parents_length = len(parents)
desired_length = len(pop) - parents_length
children = []
while len(children) < desired_length:
male = randint(0, parents_length-1)
female = randint(0, parents_length-1)
if male != female:
male = parents[male]
female = parents[female]
half = len(male) / 2
child = male[:half] + female[half:]
children.append(child)
parents.extend(children)
return parents
def evolve(pop_size=1000, target=15, individ_size=5, retain=0.3,
generations=1000, random_select=0.05, mutate=0.01):
p = population(size=pop_size, length=individ_size)
history = [ p, ]
fit_history = [ grade_population(p, target=target) ]
for i in range(generations):
p = generation(p, target, retain, random_select, mutate)
history.append(p)
fit_history.append(grade_population(p, target=target))
return history, fit_history

If these computer science entries are ever collected into a book, the title will contain the word biased in bold Marker Felt.↩