E N D
Lecture 2 As usually happens, the “discovery” of genetic algorithms led to the “discovery” of some theorems that could be interpreted as implying some kind of convergence for the processes involved (note all the qualifiers in the sentence). There was, as one might expect, a certain amount of “reinvention of the wheel”: after all, biologists had been doing mathematical modeling of evolutionary processes since the time of the First World War, and the first theoretical results in GA’s started appearing in the 1970’s.
Lecture 2 The idea of a schema: how can we represent a subset of the population whose individuals all share some characteristics? In case the chromosomes are represented by fixed-length strings over a finite alphabet, a schema is a pattern, of the same length as a chromosome, some of whose positions are filled by elements of the alphabet and some by “don’t care” symbols. If the alphabet is a binary alphabet, a schema could be represented as (* * * 1 * * 0 0 * * 1) indicating any string of 11 binary digits with ones and zeros in the positions indicated, and anything at all (zeros or ones) in the positions containing an asterisk.
Lecture 2 What are the effects of the usual operators (crossover and mutation) on a schema? The probability that a schema will mutate into a different schema depends on the probability of a bit mutating times the number of fixed bits in the schema (we assume independence). The position of the bits is immaterial, but the more “defined” a schema is, the more likely it is to be destroyed by mutation. The probability that a schema will cross-over into a different schema (assume single-point cross-over) depends on the distance between the first fixed bit and the last fixed bit: the larger the distance, the greater the probability that the cross-over break will occur strictly between the bits, and so, unless both parents are representatives of the same schema, the descendant is very likely to belong to neither.
Lecture 2 Definitions: the order of a schema is the number of defined bits in the schema; the defining length of a schema is the distance between the first and last defined bits (count from the bit after the first fixed bit and stop at the last, inclusive - or count starting at the first fixed bit, but start your count from 0). Intuitively at least, schemata with low order and short defining length have a higher probability of ”surviving” from one generation to the next, all other conditions being equal. Before going on to the discussion in Langton and Poli, we will introduce the somewhat simpler discussion in Mitchell.
Lecture 2 Observation: Not every subset of the set of binary strings of length l is uniquely represented by a schema. Proof. There are 2l such strings, and so the power set of such a set has cardinality exp(2, 2l). On the other hand, the set of schemata has cardinality 3l. Observation: every bit string of length l is an instance of 2l schemata. Proof. The set of all strings of length l whose positions are either fixed by the character (0 or 1) in the same position of the given string or contain the don’t care character. Observation: any given population of n strings contains instances of between 2l and n•2l schemata. Proof. Just as many schemata - times 2l - as there are different strings.
Lecture 2 Observation: at each generation, the Genetic Algorithm is explicitly evaluating the fitness of n strings, while simultaneously (and implicitly) evaluating the fitness of all the schemata of which they are instances. Caveat. The problem with this is that each schema may have too few instances of strings to conclude that the sample average (the subset of the n strings that match this schema) is anywhere close to the population average (all string that match this schema). Furthermore, as the algorithm progresses, one might expect that many schemata, especially the ones of higher order, will have very little, if any, representation in the sample of the current generation. No conclusion could be realistically claimed about their fitness.
Lecture 2 We are now ready to develop some machinery to state and prove Holland’s “schema theorem” (using the later notation in L&P). • Let H be a schema with at least one instance in the population at time t. • Let m(H, t) be the number of instances of H at time t. • Let f(H, t) be the observed (based on the sample) average fitness of (the strings of) H at time t. • Let f(x) be the fitness of a string x; let be the average fitness of the population at time t. • We want to calculate E(m(H, t + 1)), the expected number of instances of H at time t + 1.
Lecture 2 A crucial issue is the methodology chosen to select the “reproductive probability” of any individual in the current population. We will assume selection to be carried out so that the expected number of offspring of a string x at time t is equal to . This will be called fitness proportional reproduction. Without the effects of crossover or mutation, we have where x ÎH denotes “x is an instance of H”. This result lets us claim that, even though the mean sample fitness for a schema is never explicitly calculated, that value, , does appear in the development.
Lecture 2 We now must re-introduce crossover and mutation. Both can have destructive as well as constructive effects on the expectation for a schema. If we compute only the destructive effects, we will have a lower bound on the expectation for a schema. • Let denote the probability that single-point cross-over will be applied to a string. • Let Sc(H) denote the probability that a schema H will survive single-point cross-over, i.e., at least one of the descendants of an instance of H is also an instance of H. • Let L(H) denote the defining length of H. • Let l be the length of bit-strings in the search space.
Lecture 2 The probability that a schema will be destroyed can be no worse than which implies that the probability of survival of a schema from the effects of cross-over satisfies the inequality The disruptive effects of mutation (we are interested in worst case scenarios) have to be added.
Lecture 2 • Let pm denote the probability of any bit being mutated. • Let o(H) denote the order of H (= the number of defined bits in H). • Let Sm(H) denote the probability that the schema H will survive under mutation of an instance of H. We have the inequality: Sm(H) ≥ (1 - pm)o(H). The final result, incorporating all the probabilities just derived (probability that the schema is generated and survives cross-over and survives mutation): Holland’s Schema Theorem.
Lecture 2 An optimistic interpretation of this result goes as follows: short, low order schemata whose average fitness remains above the mean will receive an exponentially increasing number of samples (instances) over time, since the number of samples of those schemata that are not disrupted and remain above average in fitness increases by a factor of at each generation. A problem with that interpretation is that in a potential population of, say, 250 = approx. 1015 individuals, the actual populations for our evolutionary world are rarely larger than 50-100 individuals. Assuming that they are “representative” and are not very badly skewed as the algorithm progresses is not quite realistic.
Lecture 2 Similar considerations of the effect of small populations on the evolutionary progress hold for just about any result we can imagine. The next result we look at is an earlier result [Price, 1970, Nature], which was originally presented in the context of biological evolution, and which is claimed to imply Holland’s Schema Theorem. We start by reviewing a couple of definitions from elementary probability theory; Variance and Covariance.[W. Feller, Intro. to Prob. Theory and its Applications, Wiley, 1957] Let X and Y be two random variables over the same sample space. Then X + Y and XY are again random variables with distributions obtained in an obvious manner from the original ones.
Lecture 2 Definition. The Expectation of XY is given by E(XY) = Sxjykp(xj, yk), provided the series converges absolutely.(Replace summations by integrals for more general distributions). Claim: if E(X2) and E(Y2) exist, then E(XY) exists. Pf.: |xjyk| ≤ (xj2 + yk2)/2. Definition. Let mx = E(X), my = E(Y), denote the expectations of the two variables. Claim: the variables X - mx, and Y - my have means 0. Claim: E((X - mx)(Y- my)) = E(XY) - mxE(B) - myE(X) + mxmy = E(XY) - mxmy .
Lecture 2 Definition. The covariance of X and Y is defined by Cov(X, Y) = E((X - mx)(Y- my)) = E(XY) - mxmy , The definition is meaningful whenever X and Y have finite variances. Theorem. If X and Y are independent variables, Cov(X, Y) = 0. Proof. If X and Y are independent variables, E(XY) = E(X)E(Y). Caveat: the converse is not true.
Lecture 2 Theorem (Price’s Selection and Covariance Theorem). Let • Q = frequency of a given gene (or linear combination of genes - think of a gene as a binary vector of length N) in the population. • DQ = change in Q from one generation to the next. • qi = frequency of the gene in the individual i. • zi = number of offspring produced by individual i. • = mean number of children produced. Then Caveat: the claims in the textbook (quoting the literature) about the conditions under which it is true (in the real world) sound much too good to be true… In particular, it should hold for all our applications.
Lecture 2 Note: the covariance is related to the correlation coefficient: where sx and sy are the standard deviations (square roots of the variances) of the random variables X and Y, where sx2 = Var(X) = E(X2 - (E(X))2). Positive covariance is equivalent to positive correlation coefficient: the offspring population would be positively correlated with its parent population. Which is something you might expect. Proof (of Price’s Theorem). We introduce some terminology to justify the set-up for the theorem, and to carry out the details of the proof and the simplifications involved.
Lecture 2 Let: • P1 = parent population. • P2 = child population. • M = size of initial population. • gi = number of copies of gene in individual i. • qi = gi. (why? - see later) • = Arithmetic mean of qi in population P1. • Q1 = Frequency of given gene (or linear combination of genes) in P1. • Q2 = Frequency of gene in P2. • zi = Number of offspring produced by individual i. • = Mean number of children produced.
Lecture 2 • g’i= Number of copies of the gene in all code fragments in the next population produced by individual i. • q’i = Frequency of gene in the offspring produced by individual i. Defined by • Dqi = q’i - qi. The first step involves finding the frequency of the gene in the current population, Q1.
Lecture 2 The second step involves computing the frequency of the gene in the descendant population:
Lecture 2 Subtracting: We are now left with justifying the vanishing of the second fraction. The numerator is just SiziDqi, and we observe that the first term is just the number of offspring produced by i, while the second term, Dqi, is just the change in frequency of the gene from parent to offspring. The claim made is that “if fertilization and meiosis are random w.r.t. the gene, the summation will be zero, except for statistical sampling effects (random drift), and these will tend to average out over generations” = expected value vanishes. Selection for reproduction is dependent on fitness (and thus presence of specific genes), while selection of crossover and mutation is random and independent of genes (which needs proof…)
Lecture 2 And we finally have the result: Corollary. If the population size is unchanged from one generation to the next, and if two parents are required for each individual created by cross-over, then where is the fraction that are created by cross-over between two parents. Proof. Let pr be the fraction of children identical to their parent, let pm be the fraction that are mutated copies of their single parent and the fraction created by cross-over. We first observe that, if all children were created by cross-over, the mean number of children (per parent),
Lecture 2 denoted by , would be two (the generational size is assumed constant). Taking into account the other two “methods” of reproduction, Assuming they are all mutually exclusive, we have the constraint This gives and the corollary.
Lecture 2 Tournament Selection. There are multiple ways of selecting the members of the population that will contribute to the next generation. As one would expect, there is no best way. In general, one would expect that the fittest individuals will contribute proportionally more descendants to the next generations, while the less fit ones will contribute fewer or even none. The problem is the determination of this proportionality, associated with the choice of who, specifically, will reproduce and who will be sterile. Tournament selection is a way of choosing the part of the original population that will reproduce, and providing this part with a ranking. As it turns out, this ranking does not depend directly and exclusively on fitness.
Lecture 2 Definition. Let M > 0 be the size of a population; let 2 ≤ T ≤ M be the size of the tournament. Select at random T individuals from the original population. The fittest individual among the T selected is copied into a “reproducing population”. The process is repeated, with replacement, M times. The individuals in the reproducing population are ranked by the number of times they have been selected (i.e., they are sorted by score). If the reproducing population has size M, the best individual has rank r = M, the next best has rank r = M - 1, etc., all the way down to rank r = 1. Ties may be broken arbitrarily or by some secondary criterion. The lowest ranks will be filled by those individuals who were not chosen through the tournaments. This would permit the fittest individual to end up at the very bottom of the reproducing population rankings - unlikely, but possible. Why? Ostensibly, to avoid premature convergence…
Lecture 2 We need to estimate the expected number of children, to use in a variant of Price’s Theorem: we try to show that, under some reasonable assumptions E(zi) = t (i/M)t-1, where t is the size of the tournament, M the size of the population and i the index (and rank) of the individual whose offspring we wish to estimate. Our own textbook is, at best, telegraphic in its derivation of the estimate.
Lecture 2 The description in Mitchell’s book is somewhat different from what we will use, based on a discussion by Goldberg. In many instances, the tournament size is just 2. The potential confusion is a common problem in many discussions - the terms used are not quite identical in meaning… and many texts discuss the idea. We now estimate the number of individuals in the tournament selected population that came from each fitness value in the original one. One interpretation, if one simply applies tournament selection, is that of the expected number of descendants that each member of the original population will leave. We modify the discussion in [Blickle and Thiele, 1996b], as giving a more usable (= better than L. & P.’s) interpretation and derivation.
Lecture 2 Definition (Fitness Distribution). The function assigns to each fitness value f Î R the number of individuals in a population P of size M taken from the universe J carrying the given fitness value. s is called the fitness distribution of P. Definition (Cumulative Fitness Distribution). Let n ≤ M be the number of distinct fitness values, and f1 < f2 < … < fn-1 < fn the ordering of the fitness values. S(fi) denotes the number of individuals in P with fitness value fi or worse and is called the cumulative fitness distribution:
Lecture 2 Theorem. Let s* denote the expected fitness distribution of the population obtained from P by tournament selection of size t ≥ 2. Then s*(fi) = M((S(fi)/M)t - (S(fi-1)/M)t). Proof. We first calculate the expected number of individuals with fitness fi or worse who will appear in the tournament selected population - denote this number by S*(fi). Observe that an individual with fitness fi or worse can only win a tournament if all other individuals in the tournament have fitness fi or less. The probability that one individual have fitness fi or worse is given by S(fi)/M; the probability that all t of them have fitness fi or worse is (S(fi)/M)t (we are doing selection with replacement). Since we are running M tournaments (picking one individual per tournament - tournaments are independent), S*(fi) = M (S(fi)/M)t .
Lecture 2 The expected number of individuals with fitness exactly fi is given by s*(fi) = M (S(fi)/M)t - M (S(fi-1)/M)t = M ((S(fi)/M)t - (S(fi-1)/M)t). This is somewhat less direct than the statement on p. 31 of our text. We can recover that statement as follows: with a population where no two individuals have the same fitness score, we have now computed the expected number of tournament wins (= children?) for individual i, E(zi). Furthermore, if individuals in the original population are indexed by increasing fitness, the expected rank for individual i is exactly i, since the expected number of times it is chosen by tournament selection is M (S(fi)/M)t- M (S(fi-1)/M)t = M (i /M)t- M ((i - 1)/M)t > M ((i -1)/M)t- M ((i - 2)/M)t > …(proof?) Thus : E(zi) = M ((i /M)t - ((i - 1)/M)t).
Lecture 2 The problem is that, in the absence of cross-over and mutation, the new population may well have multiple individuals with the same fitness score - the construction cannot be repeated through another generation. If we add the assumption that M is large (M >> 1) and t is small compared to M (reasonable), we can simplify E(zi) = M((i /M)t - ((i-1)/M)t) = M(i /M - (i-1)/M)(Sj=0t-1 (i /M )j((i-1)/M)(t-1-j)) = (1/M)t-1Sj=0t-1ij(i-1)t-1-j and t ((i-1)/M)t-1 ≤ E(zi) ≤ t (i/M)t-1 . At least for highly ranked individuals, who are going to contribute most to the next generation, the ratio (i-1)/i is close to 1, and the small power to which it is raised leaves it close to 1.
Lecture 2 This attempts to justify the approximation: E(zi) = t (i/M)t-1. The effect of tournament selection on gene frequency in a population can be modeled by Price’s Theorem, under the assumption that the number of expected children of an individual - with mutation and cross-over - is the same as that predicted by the formula above: it depends on the original ranking of the individuals by fitness. This may all be perfectly justifiable - it just feels a little vague, at best.
Lecture 2 One of the assumptions in Price’s Theorem is that the term SiziDqi vanishes - the genetic drift is 0. This may be reasonable for large populations, it is less reasonable for small ones. It is also less reasonable under conditions where randomness in cross-over or mutation leads to a large number of non-viable offspring, requiring repair algorithms that introduce a further correlation between the current generation and the next one (you could throw the unviables away and retry: how expensive is determining whether something is unviable? how high is the probability of generating unviables?).
Lecture 2 The contributions of schema creation to Holland’s Theorem. We are now ready for a slightly tighter form of Holland’s Schema Theorem. We will use the original notation.
Lecture 2 Theorem. Proof. The first three terms in the product have already been explained. The fourth term appears more complex: it contains the terms accounting for the probability of disruption due to cross-over, but further modified. Why? The term L(H)/(l - 1) (the fragility of the schema) is simply the probability that the cross-over breaks the schema. On the other hand, if both parents match the schema, all their children will, regardless of where the break occurs.
Lecture 2 We first compute the probability that given one parent, the other parent will match the schema H (at time t): is the probability that a string from H will be selected as the second parent (fitness proportionate selection with replacement). Subtracting it from 1 gives the probability that we will select a second parent from a different schema. This improves the original estimate. The theorem can be extended without further work to any selection-with-replacement mechanism.
Lecture 2 How can we extend these results to Genetic Programming? In GP we don’t usually have a bit-linear fixed-length representation of a population of programs, so patterns have to be represented in somewhat different ways. In bit-linear fixed-length representations it is customary to define a schema in such a way that the position of a specified substring is also specified: #01#1 specifies that the string 01 starts at position 2, and that the string 1 starts at position 5. One could easily find equivalent representations (e.g., [(11, 1), (111, 4), (1, 8)] would correspond to 11#111#1). Even in bit-linear fixed length representations the absence of positional information would alter the meaning of a schema: [11, 111, 1] over binary strings of length 8 would “match” many more strings than [(11, 1), (111, 4), (1, 8)] .
Lecture 2 If we move to trees, where the notion of order is much less clear, we may have to deal with position-less schemata and interpreting them becomes more complex (if it even remains meaningful). The 1990’s provided several attempts at formalizing the Genetic Programming field - moving from the “just do it” of the s-expression-based LISPers to the more complex formalisms of languages with more complex syntax… (and LISP, too) Koza (a LISPer - 1992) started with simple non-positional schemata: lists of s-expressions that could be matched anywhere in the program tree. Ex.: H = [(+ 1 x), (*xy)] matches all those programs that have at least one occurrence of each expression (multiple occurrences are OK). Cross-over moves subtrees between parents and children.
Lecture 2 Another example:
Lecture 2 Altenberg’s GP Schema Theory. In a 1994 paper, Altenberg used the idea schema = expression. Earlier, Koza, following the ideas from Genetic Algorithms, had introduced a schema as a collection of (possibly) multiple subexpressions (as we just saw), without being able to provide any theorems. Altenberg also assumed that no mutation would take place, that selection was fitness proportionate, and that the population was large (infinite).
Lecture 2 Some notation: • f(i) = Fitness of program i. • P = The population - cardinality = M. • m(j, t)/M = Frequency of program j at generation t. • = Mean fitness of programs in generation t. This is given by Note: • S = Space of all possible subexpressions extractable from P . • = Probability that inserting expression s in program j produces program i. • = Probability that cross-over picks up expression s in program k.
Lecture 2 One of the interesting changes from Holland’s Schema Theorem is that the result is an equation: the work performed is somewhat harder (all ways in which a program can be created must be considered), but the result is sharper (still an expected value). Now the formula: Explanation. The first part provides the frequency of program i in the case when cross-over was not called for. The first term is the probability of no cross-over; the second term is the contribution of fitness-proportional selection; the third term represents the frequency of program i in the current generation.
Lecture 2 The second part, beyond containing the probability that cross-over occurs, must take into account all other ways in which program i can be assembled: Given any two programs, k and j, from the population, with k being the “donor” and j being the “recipient” (note that the pairs (j, k) and (k, j) will both be looked at), the formula contains: in the first summation, the contribution of the fitness proportionate selection; in the second summation, the probability that program k contains an expression s that can alter to j to make up i. We have really computed the “expected frequency” of program i in the next generation under cross-over.
Lecture 2 If we replace the selection methodology from fitness proportionate to just p(h, t) (= the probability of selection of program h at generation t), we can rewrite the formula (as a proper expectation): The textbook works its way through an example - the only problem is that the population of possible programs is very small: although all the individual computations are correct, the results may not be meaningful except as an example of what is happening, and an attempted elucidation of what would be necessary to implement such an estimate.
Lecture 2 We have just computed the expected propagation of a program from one generation to the next. What about schemata? In Altenberg’s theory, a schema is just an expression - it need not be a full program. What do we do? First throw the random number generator to determine whether you have a cross-over or not. If you do not have a cross-over, then schema s will survive (with a certain frequency - number of copies over total population size) only in so far as a randomly chosen program i from the population contains s as a subexpression, and a random choice actually extracts s from i. The contribution from the non-cross-over part (fitness-proportional reproduction) is
Lecture 2 If we define , where C(…) denotes the probability that we will extract s from i, we now have a formula for the expected probability that the “random dip” associated with a cross-over operation will result in the schema s being selected, i.e., the probability that schema s is present in P at generation t. Note that C(…) was originally defined as the probability that cross-over will extract s from i, but this is nothing but a “random extraction”. We also have a formula for the schema fitness in this generation: Joining these formulae, we have the non-cross-over term
Lecture 2 The cross-over term must now be since we are not just concerned about how a program i comes into being in the next generation, but how schema s can appear in any program in the next generation. Adding the two terms,
Lecture 2 When we generalize the formula to other selection methods (and other, finite, populations): Note: no conceptions of order or of defining length have been introduced or used. There are no “simple” notions to attach our theory to.
Lecture 2 O’Reilly’s Schema Theory. This was an attempt at re-introducing the notions of order and defining length. O’Reilly defined a schema as a multiset (= bag, unordered collection with repetitions) of subtrees and tree fragments. Tree fragments are trees with at least one leaf replaced by the “don’t care” symbol, which can be matched by any subtree - including those with just one node. The schema [(+ # x), (* x y), (* x y)] represents all the programs including at least one occurrence of (+ # x)- a tree fragment , and at least two occurrences of (* x y), a tree. Note that this definition gives us only the defining components of a schema, but not their position: there are many ways in which a schema can be instantiated by the same program.
Lecture 2 Another example: