Digression: Symbolic Regression

Digression: Symbolic Regression • Suppose you are a criminologist, and you have some data about recidivism. Injects Heroin in Eyeballs Recidivist Years in Prison Holds Ph.D IQ 10 0 87 1 1 4 1 86 0 0 22 1 186 1 1 6 0 108 0 1 8 0 143 0 0 : : : : :

Criminology 101 • You want a formula that predicts if someone will go back to jail after being released. • The formula will be based on the data collected, so the “independent variables” are • x1 = number of years in jail • x2 = holds Ph.D. • x3 = IQ • etc. • This is usually done with “regression”. Here is a simpler example, with one independent variable.

Symbolic Regression • A simple data set with one independent variable, called x. What’s the relationship between x and y? x y y 1 2 4 5 7 : 2.1 3.3 3.1 1.8 3.2 : x

Symbolic Regression • You might try “linear regression:” y y = mx + b x

Symbolic Regression • You might try “quadratic regression:” y y = ax2 + bx + c x

Symbolic Regression • You might try “exponential regression:” y y = axb + c x

Symbolic Regression • How would you choose? • Maybe there is some underlying “mechanism” that produced the data. • But you may not know… • “Symbolic regression” finds the form of the equation, and the coefficients, simultaneously.

How To Do Symbolic Regression? • One way: genetic programming. • “The evolution of computer programs through natural selection.” • The brainchild of John Koza, extending work by John Holland. • A very bizarre idea that actually works! • We will do this.

Regression via Genetic Programming • We know how to produce “algebraic expression trees.” • We can even form them randomly. • Koza says “Make a generation of random trees, evaluate their fitnesses, then let the more fit have sex to produce children.” • Maybe the children will be more fit?

Expression Trees Again • A one-variable tree is a regression equation: + * - x 2 + x x .5 y = (((x + 0.5) - x) + (2 * x))

Evaluating Expression Trees yp = (((x + 0.5) - x) + (2 * x)) x yo yp |yo - yp|2 Superscripts: “o” for “observed” “p” for “predicted” 1 2 4 5 7 2.1 2.5 0.16 3.3 4.5 1.44 3.1 8.5 29.16 1.8 10.5 75.69 3.2 14.5 127.69 234.14 = “fitness”

A Generation of Random Trees Tree 1 Tree 2 Tree 3 Tree 4 … Tree Fitness 1 335 2 1530 3 950 4 1462 : : (most of these are really rotten!)

Choosing Parents Tree 1 Tree 2 Tree 3 Tree 4 Generation 1 … Tree Fitness 1 335 2 1530 3 950 4 1462 : : Choose these two, randomly, “proportional to their fitness"

“Sexual Reproduction” Choose “crossover points”, at random Generation 1 Then, swap the subtrees to make two new child trees: Generation 2

The Steps • Create Generation 1 by randomly generating 500 trees. • Find the fitness of each tree. • Choose pairs of parent trees, proportional to their fitness. • Crossover to make two child trees, adding them to Generation 2. • Continue until there are 500 child trees in Generation 2. • Repeat for 50 generations, keeping the best (most fit) tree over all generations.

How Could This Possibly Work? • No one seems to be able to say… • John Holland proved something called the “schema theorem,” but it really doesn’t explain much. • It’s a highly “parallel” process that recombines “good” building blocks. • It really does work very well for a huge variety of hard problems!

Why This, in a Java Course? • Because we’re going to implement it! • Because writing code to implement this isn’t too hard. • Because it illustrates a large number of O-O and Java ideas. • Because it’s fun! • Here is what my implementation looks like:

Digression: Symbolic Regression

Digression: Symbolic Regression

Presentation Transcript

Introduction to Cox Regression

Regression in geoDA

Lecture 9: Marginal Logistic Regression Model and GEE (Chapter 8)

A Review of Hatch’s “Organization Theory: Modern, Symbolic, and Postmodern Perspectives”

Logistic Regression – Simultaneous Entry of Variables

Introduction to Symbolic Logic

Multiple Regression

Statistical Inference and Regression Analysis: GB.3302.30

Stepwise Binary Logistic Regression

Artistic Regression

Linear Regression and Correlation Analysis

Chapter 11

Logistic Regression: For when your data really do fit in neat little boxes

PM 515 Behavioral Epidemiology Generalized Linear Regression Analysis

Statistical Inference and Regression Analysis: GB.3302.30

What statistical analysis should I use?

Chapter 12 Multiple Regression

Applied Econometrics Second edition

Instrumental Variables Regression