240 likes | 253 Views
Explore how information distance and sampling semantics redefine fitness in program synthesis systems. Understand program semantics, measure incorrect programs, and assess program fragments' quality using program theories.
E N D
information theory, fitness and sampling semantics • colin johnson / university of kent • john woodward / university of stirling
Schtick • That we can use the set of ideas around entropy, information theory and algorithmic complexity as a way of assigning fitness in program synthesis systems. • This involves the idea of an information distance between sampling semantics vectors and problem target definitions across a set of training cases. • In particular we describe how to assign fitness to subprograms without putting them in the context of a whole program.
Canonical form of I/O mapping Program Text Semantics in GP etc. Canonical-representation Semantics
Why do we Care about Semantics? • In the end, problems cash out as input-output behaviour. • By having an understanding of program semantics, we can: • avoid duplicating programs with different representations but the same I/O behaviour in the population • choose points for crossover/mutation in a more informed way • build new frameworks (such as geometric semantic GP) that manipulate program meanings.
Vector of outputson training set Canonical form of I/O mapping Program Text Program Text Semantics in GP etc. Canonical-representation Semantics Sampling Semantics
Sampling Semantics • Sampling semantics (O’Neill, Nguyen, et al.) are a data-driven way of defining a semantic representation for any kind of function. • The sampling semantics of a function over a particular (ordered) training set T is simply a vector of the outputs of that function over T. • This emphasis on the set of outputs (rather than just, say, a sum of errors) allows us to define metrics on pairs of population members e.g. to define how close they are in meaning.
What do we really want? • GP assigns fitness on the basis of counting how many fitness cases are solved by each program in the population, or by summing up the total error. • This is the wrong thing to measure. • We want to measure whether sub-programs add information/structure that will make it easier for later parts of the program to solve the problem.
The Semantics of Wrong Programs • Much of computer science is interested in reasoning about correct programs (or, reasoning about whether programs are correct). • But, most programs are wrong most of the time during development. • We need ideas that help us to reason about wrong programs, and their relationship to the target specification. • Can we measure how much problem-specific structure a program fragment is generating?
Similarity Measures • When are two things similar? For example, two programs, or the output from a program and the target value? • Clearly, bitwise difference is not the most important thing.
Information Distance • Instead of pointwise distance, consider instead information distance (Vitányi et al.). • An example information distance is the length of the shortest program required to transform one thing into the other. • The program that outputs 10101010101010101010 against the target01010101010101010101 is “better” than01000010110100010110 even though the latter is fitter on a conventional measure.
Information Distance Fitness • Combine the idea of information distance and sampling semantics to get a new notion of fitness. • The fitness of a program fragment is the length of the smallest program required to transform the sampling semantics vector into the target vector. • A computationally grounded notion of “wrong” should be grounded in how much computation is needed to make the program “right”.
Programs by Accumulation • Rather than the GP notion of a population of complete programs, we will find it easier to work with a set of program fragments. • Let us call these fragments “theories”. • Good theories represent partial solutions to all (or many) training cases; not complete solutions to some training cases. (We “cut horizontally” rather than “cutting vertically”). • We can compare theories by their information distance to the target.
Assigning Quality to Program Fragments • Most GP research to date assigns fitness to programs. That is, we need a complete program before we can assign fitness to programs, and we don’t assign fitness directly to substructures. • In machine learning (e.g. C4.5), we assign “fitness” to combinations of features by using e.g. ideas like information gain. That is, we assign a fitness to a partial“program”.
Compressibility • One way to describe strings or mappings is in terms of their algorithmic information complexity, such as the Kolmogorov complexity. • Roughly speaking, this is a measure of the shortest possible program required to compute the string. • So, for example, 1010101010101010 can be described by a shorter program than 1001010101111010. • Non-computable; but, we can approximate it by running a compression algorithm on the string.
Which is best: f1 or f2? • diff f1and parity is more compressible. We can find a shorter description of it.
Compression-based Program Synthesis (TDFcomp) • Choose a set of functions F. • Create a construction set C, initially containing all of the input variables. • LOOP: • create a number (500) of sub-programs by applying functions from F to members of C • calculate the difference between the output from these sub-programs and the target on all inputs (Hamming) • choose the most compressible difference and add the relevant sub-program to C (gzip) • UNTIL C contains a program that is a solution to the whole problem
Example: 4-bit even parity • run: • 0: (XNOR v_3 v_2) with quality 26 • 1: (XNOR v_2 v_3) with quality 26 • 2: (XNOR v_2 v_3) with quality 26 • 3: (XOR v_2 v_3) with quality 26 • 4: (XOR v_3 v_2) with quality 26 • 5: (AND v_2 v_1) with quality 27 • 6: (XNOR v_1 v_2) with quality 27 • 7: (XOR v_2 v_1) with quality 27 • 8: (1st v_1 v_0) with quality 28 • 9: (OR v_2 v_0) with quality 28 • 10: (OR v_2 v_3) with quality 28 • ...
Typical Run (2) • run: • ***************** Iteration 1 • (XNOR v_2 v_3) with quality 26 • ***************** Iteration 2 • (XOR (XNOR v_2 v_3) v_1) with quality 24 • ***************** Iteration 3 • (XOR (XOR (XNOR v_2 v_3) v_1) v_0) with quality 23 • ###################################### • 1 perfect solution found, which is: • (XOR (XOR (XNOR v_2 v_3) v_1) v_0) with quality 23 • BUILD SUCCESSFUL (total time: 1 second)
Typical Run (2) • run: • ***************** Iteration 1 • (XNOR v_3 v_2) with quality 26 • ***************** Iteration 2 • (XNOR (XNOR v_3 v_2) v_1) with quality 24 • ***************** Iteration 3 • (XOR (XNOR (XNOR v_3 v_2) v_1) v_0) with quality 23 • ***************** Iteration 4 • (NOT2 v_0 (XOR (XNOR (XNOR v_3 v_2) v_1) v_0)) with quality 23 • ###################################### • 14 perfect solutions found, which are: • (NOT2 v_0 (XOR (XNOR (XNOR v_3 v_2) v_1) v_0)) with quality 23 • ....
...and traditional GP for contrast! • run: • 0 2.0 XOR(NAND(OR(OR(NAND(OR(d3 d1) AND(d0 d0)) XOR(AND(d3 d0) OR(d0 d0))) XOR(XOR(NAND(d1 d3) XOR(d0 d2)) XOR(XOR(d2 d0) XOR(d2 d0)))) AND(AND(OR(OR(d2 d1) XOR(d3 d0)) OR(AND(d2 d3) XOR(d3 d0))) OR(NAND(NAND(d2 d0) AND(d2 d2)) XOR(OR(d0 d1) OR(d0 d2))))) AND(OR(NAND(XOR(NAND(d2 d2) AND(d2 d0)) NAND(XOR(d1 d1) XOR(d0 d1))) XOR(XOR(AND(d1 d2) OR(d0 d2)) XOR(OR(d3 d0) OR(d1 d2)))) NAND(OR(XOR(NAND(d3 d1) XOR(d1 d2)) AND(OR(d1 d2) AND(d3 d3))) XOR(NAND(NAND(d3 d0) OR(d3 d1)) XOR(XOR(d3 d2) AND(d1 d0)))))) • 1 2.0 XOR(NAND(OR(OR(NAND(OR(d3 d1) AND(d0 d0)) XOR(AND(d3 d0) OR(d0 d0))) XOR(XOR(NAND(d1 d3) XOR(d0 d2)) XOR(XOR(d2 d0) XOR(d2 d0)))) AND(AND(OR(OR(d2 d1) XOR(d3 d0)) OR(AND(d2 d3) XOR(d3 d0))) OR(NAND(NAND(d2 d0) AND(d2 d2)) XOR(OR(d0 d1) OR(d0 d2))))) AND(OR(NAND(XOR(NAND(d2 d2) AND(d2 d0)) NAND(XOR(d1 d1) XOR(d0 d1))) XOR(XOR(AND(d1 d2) OR(d0 d2)) XOR(OR(d3 d0) OR(d1 d2)))) NAND(OR(XOR(NAND(d3 d1) XOR(d1 d2)) AND(OR(d1 d2) AND(d3 d3))) XOR(NAND(NAND(d3 d0) OR(d3 d1)) XOR(XOR(d3 d2) AND(d1 d0)))))) • 2 2.0 XOR(NAND(OR(OR(NAND(OR(d3 d1) AND(d0 d0)) XOR(AND(d3 d0) OR(d0 d0))) XOR(XOR(NAND(d1 d3) XOR(d0 d2)) XOR(XOR(d2 d0) XOR(d2 d0)))) AND(AND(OR(OR(d2 d1) XOR(d3 d0)) OR(AND(d2 d3) XOR(d3 d0))) OR(NAND(NAND(d2 d0) AND(d2 d2)) XOR(OR(d0 d1) OR(d0 d2))))) AND(OR(NAND(XOR(NAND(d2 d2) AND(d2 d0)) NAND(XOR(d1 d1) XOR(d0 d1))) XOR(XOR(AND(d1 d2) OR(d0 d2)) XOR(OR(d3 d0) OR(d1 d2)))) NAND(OR(XOR(NAND(d3 d1) XOR(d1 d2)) AND(OR(d1 d2) AND(d3 d3))) XOR(NAND(NAND(d3 d0) OR(d3 d1)) XOR(XOR(d3 d2) AND(d1 d0)))))) • 3 1.0 XOR(XOR(XOR(XOR(XOR(OR(d2 d3) AND(d3 d3)) OR(NAND(d0 XOR(d2 d2)) AND(d0 d2))) OR(OR(NAND(d3 d3) OR(d2 d3)) AND(XOR(d2 d3) OR(d3 d2)))) XOR(OR(NAND(NAND(d1 d3) NAND(d3 d0)) NAND(AND(d1 d3) XOR(d1 d3))) NAND(OR(XOR(d1 d3) NAND(d0 d2)) OR(AND(d2 d0) XOR(d0 d1))))) NAND(XOR(AND(OR(OR(d2 d3) OR(d0 d1)) OR(AND(d1 d0) AND(d3 d3))) AND(AND(XOR(d0 d3) AND(d0 d1)) XOR(AND(d2 d1) OR(d3 d0)))) NAND(NAND(AND(d2 XOR(d3 d0)) XOR(OR(d1 d3) d2)) AND(OR(AND(d2 d2) AND(d1 d1)) XOR(AND(d3 d2) XOR(d3 d3)))))) • 4 0.0 XOR(XOR(XOR(XOR(XOR(OR(d2 d3) AND(d3 d3)) OR(NAND(d0 d2) AND(d0 d2))) OR(OR(NAND(d3 d3) OR(d2 d3)) AND(XOR(d2 d3) OR(d3 d2)))) XOR(OR(NAND(NAND(XOR(d3 d0) d3) NAND(d3 d0)) NAND(AND(d1 d3) XOR(d1 d3))) NAND(OR(XOR(d1 d3) NAND(d0 d2)) OR(AND(d2 d0) XOR(d0 d1))))) NAND(XOR(AND(OR(OR(d2 d3) OR(d0 d1)) OR(AND(OR(d1 d2) d0) AND(d3 d3))) AND(AND(XOR(d0 d3) AND(d0 d1)) XOR(AND(d2 d1) OR(d3 d0)))) NAND(NAND(d1 XOR(OR(d1 d3) d1)) AND(OR(AND(d2 d2) AND(d1 d1)) XOR(AND(d3 d2) XOR(d3 d3)))))) • BUILD SUCCESSFUL (total time: 3 seconds)
...is this a fair example? • Perhaps this isn’t the fairest of examples. • The parity problem has the advantage that, once you have combined two variables with the XOR or XNOR operator, you have extracted all of the information out of them. • In other problems, this is not the case; e.g. in the multiplexer problem you need to use the address bits more than once. • ...but, there are ways of dealing with this.
The Big Picture • GP is measuring the wrong thing: • we want to measure how (algorithmically) complex the “gap” is between the current program fragment and the target, not the error between the current program (fragment) and the target • We have shown a way to give a fitness value to small components of a program during program synthesis, rather than having to always evaluate full programs. • Can we do more to remove the “bio-inspired” from the methods and replace it with computational/informational concepts?