Bloat and Universal Consistency in GP

{ merve.amil, nicolas.bredeche, christian.gagne, sylvain.gelly, marc.schoenauer, olivier.teytaud } @lri.fr (TAO, Inria, LRI, University Paris-Sud) With thanks to William Langdon Bloat and Universal Consistency in GP

1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency 4. A penalized fitness against bloat 5. Conclusion Outline

Framework : symbolic regression (we use GP for mining a space of Turing- computable functions that fit examples) Bloat and Universal Consistency in GP

Examples : X1=[ 0.14 ; 2.07 ; -1 ] y1=0 X2=[ 1 ; 2 ; 31.5 ] y2=1 ... Xn=[ 1.5 ; -1 ; 10.0 ] yn=1 Hypothesis : (xi,yi) independent, identically distributed ==> law = law(x,y) unknown Goal : ==> finding f( ) Turing-computable (on real numbers) maximizing P( f(x)=y ) Symbolic Regression

Framework : symbolic regression Consistency : is the function that we output a good function ? (it could only work on the examples !) No-bloat : does the function we output grows infinitely as the number of examples increase ? Bloat and Universal Consistency in GP

What happens usually ? • We know P' the empirical distribution (average of the Dirac Masses at the (xi,yi) • We do not know P (ill-posed problem) • Two troubles : • Sometimes P(f(x)=y) is disappointing • Sometimes f is huge • We study the behavior as n  infinity

What happens usually ? • We know P' the empirical distribution (average of the Dirac Masses at the (xi,yi) ) • We do not know P • We would like to maximize P(f(x)=y) • We can only maximize fitness = P'(f(x)=y) (with possibly complexity-penalization terms) Q: Assume that we perfectly optimize “fitness”. Does this lead to a good function ? Does this lead to no-bloat ?

What questions do we want to answer ? • Universal Consistency : can we ensure that P(f(x)=y) ==> optimality as n  infinity ? (at least if an optimal f exists !) • Bloat : can we ensure that bloat does not occur, i.e. if a correct program f exists with bounded length, can we ensure that length(f) does not run to infinity ?

1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency (that does not work for bloat) 4. A penalized fitness against bloat 5. Conclusion Outline

The usual tools for Universal Concistency • VC-theory, and more generaly statistical learning theory, can help us

The usual tools for Universal Concistency • VC-bound (roughly) : for f in a family F of functions, | P( f(x) = y ) - P'( f(x) = y ) | < (VCdim(F) / n)1/2 = o(1) ==> consistency of learning in F • Unfortunately, GP works on F = { Turing-computable functions } ==> VC(F) = infinity

Good news : classical application of VC • However, VCdim( { functions with bounded length } ) = finite ( under some constraints : bounded execution time) (see e.g. the book of Antony&Bartlett)

Good news : classical application of VC • Th 1 : Slowly increase the bound on length so that (VC(F)/n)1/2 = o(1) ==> | P( f(x) = y ) - P'( f(x) = y ) | = o(1) once the length is sufficient to allow a good f, and once o(1) is small, P(f(x) = y) is good ! • This is classical in statistical learning

Bad news : bad for bloat ! • Consider f'= argmax P'(x=y) among F with length < L(n) increasing to infinity as a function of n • Ok, if L increases slowly enough, P(f'(x)=y) ==> optimality • But we show that for some simple P, f' has length running to infinity, whenever a bounded-length-function is optimal

The counter-example Three areas : P(y=1|x<-1) = 9/10 P(y=1|x>1) = 1/10 P(y=1| x > -1 and x < 1 ) = ½ X uniform in [-2,2] x X<-1  y probably=1 X>1  y probably=0 -1<X<1  P(y=1)=1/2

The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP without penalization: fits to most examples !  No consistency, and bloat !

The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP with small penalization: fit to most examples !  Consistency, but bloat !

The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP with bigger penalization: ok !  Consistency, no bloat !

The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP with too strong penalization: ok !  No consistency, no bloat !

A solution • Consider f'= argmax P'(f(x)=y) - penalization(f',n) • Choose penalization(f',n) just strong enough • Then, (i) universal consistency holds (ii) length --> minimal length of an optimal code

The proof (sketch) (1/3) • Consider f'= argmax P'(f(x)=y) - penalization(VC(f),n) • We look for conditions on the penalization ensuring that P(f' ok)  maximum and length(f')  minimum ?

The proof (sketch) (2/3) • f'= argmax P'(f(x)=y) - penalization(VC(f),n) • Assume that there exists some good f* (optimal in size) • Let’s show that for n large, f’= f*(roughly) • If we restrict our attention to functions with VC(f)<K(n), then |P’-P| =O( ( VC(f) / n )1/2) • So if we forbid functions with VC(f) > K(n) f'= argmax P(ok) – penalization + O( ( VC(f) / n )1/2)

The proof (sketch) (3/3) • If penalization big in front of (VC(f)/n)1/2, f’= argmax P(ok) – penalization • K(n) increases slowly  penalization  0 P(ok)  minimum  Consistency ! P(f’ok) - pen(f’) ≥ P(f* ok) – pen(f*) + small terms • small terms ≥ P(f’ok)-P(f* ok) ≥ pen(f’) – pen(f*) • VC(f’) is VC(f*)+o(1)  optimal size

1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency (that does not kill bloat) 4. A penalized fitness against bloat 5. Conclusion Outline

Conclusion • we use VC-theory to ensure consistency ; this is a standard application of VC-theory, but as far as we know it is new in GP ; • we use VC-theory to ensure no-bloat • for this, we have introduced a precise size-penalization • experiments confirm the results (thanks to openBeagle)

Limits We deal with perfect fitness minimization, ie we consider that f'=argmax P'(f(x)=y) + …  the optimization of P'(.) is not so easy in practice !  however, optimizing P’(…)+… would be pointless if the ideal optimization of P’(…) does not work !

Other elements in the paper • Can we use hold-out to choose a bound on the length of programs ? • Can we use cross-validation to choose a bound on the length of programs ?

References • Many works proposing complexity penalization • Many work reporting bloat • Many papers explaining bloat (in particular « fitness causes bloat ») • Many good books about VC-theory  refs on the Dagstuhl site or on www.lri.fr/~teytaud

A final remark Runtime analysis of EA uses Chernoff & Hoeffding & ... Statistical Learning is the generalization of Chernoff & Hoeffding bounds for the evaluation of distributions  Statistical learning is very promising e.g. for EDA-analysis

Thanks for your attention We will be very grateful for any comment / suggestion / ... !

Bloat and Universal Consistency in GP

Bloat and Universal Consistency in GP

Presentation Transcript

Consistency and Replication

Consistency and Replication

Consistency and Replication

Replication and Consistency

Consistency in Sentencing

Replication and Consistency

Bloat

Tips and Tricks in GP

Consistency and Replication

Diabetes in GP

Consistency in NFS and AFS

Consistency in NFS and AFS

Consistency and Replication

Consistency and Replication

Consistency and Replication

Consistency and Replication

PowerPoint Bloat

Consistency and Replication

Consistency and Replication

Consistency and Replication

CONSISTENCY AND REPLICATION