320 likes | 403 Views
{ merve.amil, nicolas.bredeche, christian.gagne, sylvain.gelly, marc.schoenauer, olivier.teytaud } @lri.fr (TAO, Inria, LRI, University Paris-Sud) With thanks to William Langdon. Bloat and Universal Consistency in GP.
E N D
{ merve.amil, nicolas.bredeche, christian.gagne, sylvain.gelly, marc.schoenauer, olivier.teytaud } @lri.fr (TAO, Inria, LRI, University Paris-Sud) With thanks to William Langdon Bloat and Universal Consistency in GP
1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency 4. A penalized fitness against bloat 5. Conclusion Outline
Framework : symbolic regression (we use GP for mining a space of Turing- computable functions that fit examples) Bloat and Universal Consistency in GP
Examples : X1=[ 0.14 ; 2.07 ; -1 ] y1=0 X2=[ 1 ; 2 ; 31.5 ] y2=1 ... Xn=[ 1.5 ; -1 ; 10.0 ] yn=1 Hypothesis : (xi,yi) independent, identically distributed ==> law = law(x,y) unknown Goal : ==> finding f( ) Turing-computable (on real numbers) maximizing P( f(x)=y ) Symbolic Regression
Framework : symbolic regression Consistency : is the function that we output a good function ? (it could only work on the examples !) No-bloat : does the function we output grows infinitely as the number of examples increase ? Bloat and Universal Consistency in GP
What happens usually ? • We know P' the empirical distribution (average of the Dirac Masses at the (xi,yi) • We do not know P (ill-posed problem) • Two troubles : • Sometimes P(f(x)=y) is disappointing • Sometimes f is huge • We study the behavior as n infinity
What happens usually ? • We know P' the empirical distribution (average of the Dirac Masses at the (xi,yi) ) • We do not know P • We would like to maximize P(f(x)=y) • We can only maximize fitness = P'(f(x)=y) (with possibly complexity-penalization terms) Q: Assume that we perfectly optimize “fitness”. Does this lead to a good function ? Does this lead to no-bloat ?
1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency 4. A penalized fitness against bloat 5. Conclusion Outline
What questions do we want to answer ? • Universal Consistency : can we ensure that P(f(x)=y) ==> optimality as n infinity ? (at least if an optimal f exists !) • Bloat : can we ensure that bloat does not occur, i.e. if a correct program f exists with bounded length, can we ensure that length(f) does not run to infinity ?
1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency (that does not work for bloat) 4. A penalized fitness against bloat 5. Conclusion Outline
The usual tools for Universal Concistency • VC-theory, and more generaly statistical learning theory, can help us
The usual tools for Universal Concistency • VC-bound (roughly) : for f in a family F of functions, | P( f(x) = y ) - P'( f(x) = y ) | < (VCdim(F) / n)1/2 = o(1) ==> consistency of learning in F • Unfortunately, GP works on F = { Turing-computable functions } ==> VC(F) = infinity
Good news : classical application of VC • However, VCdim( { functions with bounded length } ) = finite ( under some constraints : bounded execution time) (see e.g. the book of Antony&Bartlett)
Good news : classical application of VC • Th 1 : Slowly increase the bound on length so that (VC(F)/n)1/2 = o(1) ==> | P( f(x) = y ) - P'( f(x) = y ) | = o(1) once the length is sufficient to allow a good f, and once o(1) is small, P(f(x) = y) is good ! • This is classical in statistical learning
Bad news : bad for bloat ! • Consider f'= argmax P'(x=y) among F with length < L(n) increasing to infinity as a function of n • Ok, if L increases slowly enough, P(f'(x)=y) ==> optimality • But we show that for some simple P, f' has length running to infinity, whenever a bounded-length-function is optimal
The counter-example Three areas : P(y=1|x<-1) = 9/10 P(y=1|x>1) = 1/10 P(y=1| x > -1 and x < 1 ) = ½ X uniform in [-2,2] x X<-1 y probably=1 X>1 y probably=0 -1<X<1 P(y=1)=1/2
The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP without penalization: fits to most examples ! No consistency, and bloat !
The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP with small penalization: fit to most examples ! Consistency, but bloat !
The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP with bigger penalization: ok ! Consistency, no bloat !
The counter-example Optimal function : if (x>0) then 1 else 0 Function found by GP with too strong penalization: ok ! No consistency, no bloat !
1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency 4. A penalized fitness against bloat 5. Conclusion Outline
A solution • Consider f'= argmax P'(f(x)=y) - penalization(f',n) • Choose penalization(f',n) just strong enough • Then, (i) universal consistency holds (ii) length --> minimal length of an optimal code
The proof (sketch) (1/3) • Consider f'= argmax P'(f(x)=y) - penalization(VC(f),n) • We look for conditions on the penalization ensuring that P(f' ok) maximum and length(f') minimum ?
The proof (sketch) (2/3) • f'= argmax P'(f(x)=y) - penalization(VC(f),n) • Assume that there exists some good f* (optimal in size) • Let’s show that for n large, f’= f*(roughly) • If we restrict our attention to functions with VC(f)<K(n), then |P’-P| =O( ( VC(f) / n )1/2) • So if we forbid functions with VC(f) > K(n) f'= argmax P(ok) – penalization + O( ( VC(f) / n )1/2)
The proof (sketch) (3/3) • If penalization big in front of (VC(f)/n)1/2, f’= argmax P(ok) – penalization • K(n) increases slowly penalization 0 P(ok) minimum Consistency ! P(f’ok) - pen(f’) ≥ P(f* ok) – pen(f*) + small terms • small terms ≥ P(f’ok)-P(f* ok) ≥ pen(f’) – pen(f*) • VC(f’) is VC(f*)+o(1) optimal size
1. The framework (symbolic regression in GP) 2. The goals (consistency and no-bloat) 3. A standard result for consistency (that does not kill bloat) 4. A penalized fitness against bloat 5. Conclusion Outline
Conclusion • we use VC-theory to ensure consistency ; this is a standard application of VC-theory, but as far as we know it is new in GP ; • we use VC-theory to ensure no-bloat • for this, we have introduced a precise size-penalization • experiments confirm the results (thanks to openBeagle)
Limits We deal with perfect fitness minimization, ie we consider that f'=argmax P'(f(x)=y) + … the optimization of P'(.) is not so easy in practice ! however, optimizing P’(…)+… would be pointless if the ideal optimization of P’(…) does not work !
Other elements in the paper • Can we use hold-out to choose a bound on the length of programs ? • Can we use cross-validation to choose a bound on the length of programs ?
References • Many works proposing complexity penalization • Many work reporting bloat • Many papers explaining bloat (in particular « fitness causes bloat ») • Many good books about VC-theory refs on the Dagstuhl site or on www.lri.fr/~teytaud
A final remark Runtime analysis of EA uses Chernoff & Hoeffding & ... Statistical Learning is the generalization of Chernoff & Hoeffding bounds for the evaluation of distributions Statistical learning is very promising e.g. for EDA-analysis
Thanks for your attention We will be very grateful for any comment / suggestion / ... !