230 likes | 504 Views
Digression: Symbolic Regression. Suppose you are a criminologist, and you have some data about recidivism. Injects Heroin in Eyeballs. Recidivist. Years in Prison. Holds Ph.D. IQ. 10 0 87 1 1
E N D
Digression: Symbolic Regression • Suppose you are a criminologist, and you have some data about recidivism. Injects Heroin in Eyeballs Recidivist Years in Prison Holds Ph.D IQ 10 0 87 1 1 4 1 86 0 0 22 1 186 1 1 6 0 108 0 1 8 0 143 0 0 : : : : :
Criminology 101 • You want a formula that predicts if someone will go back to jail after being released. • The formula will be based on the data collected, so the “independent variables” are • x1 = number of years in jail • x2 = holds Ph.D. • x3 = IQ • etc. • This is usually done with “regression”. Here is a simpler example, with one independent variable.
Symbolic Regression • A simple data set with one independent variable, called x. What’s the relationship between x and y? x y y 1 2 4 5 7 : 2.1 3.3 3.1 1.8 3.2 : x
Symbolic Regression • You might try “linear regression:” y y = mx + b x
Symbolic Regression • You might try “quadratic regression:” y y = ax2 + bx + c x
Symbolic Regression • You might try “exponential regression:” y y = axb + c x
Symbolic Regression • How would you choose? • Maybe there is some underlying “mechanism” that produced the data. • But you may not know… • “Symbolic regression” finds the form of the equation, and the coefficients, simultaneously.
How To Do Symbolic Regression? • One way: genetic programming. • “The evolution of computer programs through natural selection.” • The brainchild of John Koza, extending work by John Holland. • A very bizarre idea that actually works! • We will do this.
Regression via Genetic Programming • We know how to produce “algebraic expression trees.” • We can even form them randomly. • Koza says “Make a generation of random trees, evaluate their fitnesses, then let the more fit have sex to produce children.” • Maybe the children will be more fit?
Expression Trees Again • A one-variable tree is a regression equation: + * - x 2 + x x .5 y = (((x + 0.5) - x) + (2 * x))
Evaluating Expression Trees yp = (((x + 0.5) - x) + (2 * x)) x yo yp |yo - yp|2 Superscripts: “o” for “observed” “p” for “predicted” 1 2 4 5 7 2.1 2.5 0.16 3.3 4.5 1.44 3.1 8.5 29.16 1.8 10.5 75.69 3.2 14.5 127.69 234.14 = “fitness”
A Generation of Random Trees Tree 1 Tree 2 Tree 3 Tree 4 … Tree Fitness 1 335 2 1530 3 950 4 1462 : : (most of these are really rotten!)
Choosing Parents Tree 1 Tree 2 Tree 3 Tree 4 Generation 1 … Tree Fitness 1 335 2 1530 3 950 4 1462 : : Choose these two, randomly, “proportional to their fitness"
“Sexual Reproduction” Choose “crossover points”, at random Generation 1 Then, swap the subtrees to make two new child trees: Generation 2
The Steps • Create Generation 1 by randomly generating 500 trees. • Find the fitness of each tree. • Choose pairs of parent trees, proportional to their fitness. • Crossover to make two child trees, adding them to Generation 2. • Continue until there are 500 child trees in Generation 2. • Repeat for 50 generations, keeping the best (most fit) tree over all generations.
How Could This Possibly Work? • No one seems to be able to say… • John Holland proved something called the “schema theorem,” but it really doesn’t explain much. • It’s a highly “parallel” process that recombines “good” building blocks. • It really does work very well for a huge variety of hard problems!
Why This, in a Java Course? • Because we’re going to implement it! • Because writing code to implement this isn’t too hard. • Because it illustrates a large number of O-O and Java ideas. • Because it’s fun! • Here is what my implementation looks like: