Coevolving Solutions to the Shortest Common Superstring Problem

Coevolving Solutions to the Shortest Common Superstring Problem Assaf Zaritsky & Moshe Sipper Ben-Gurion University, Israel www.cs.bgu.ac.il/~assafza

Outline • The “Shortest Common Superstring” problem. • DNA sequencing and the input domain. • Standard and cooperative coevolutionary genetic algorithm (GA). • The Puzzle approach. • Conclusions and future work. • Messy Puzzle.

The Shortest Common Superstring Problem (SCS) • Let S = {s1,…,sn} be a set of strings (blocks) over some alphabet Σ. A superstring of S is a string x such that each si in S is a substring of x. • Problem: Find shortest (common) superstring. • NP-Complete. • MAX-SNP hard. • Motivation: DNA sequencing, data compression.

SCS: Example • S = {ate, half, lethal, alpha, alfalfa} • A trivial superstring is “atehalflethalalphaalfalfa” of length 25 (a simple concatenation of all blocks). • A shortest common superstring is “lethalphalfalfate” of length 17. • Note that a “compressed” permutation of the blocks is actually a superstring.

Approximation Algorithms • Several linear approximations for SCS have been proposed, most of which rely on greedy approaches. • GREEDY The most widely heuristic used in DNA sequencing. • Conjecture [Blum 1994, Sweedyk 1999]: Superstring produced by GREEDY is of length at most two times the optimal. • We are not aware of any previous evolutionary approach to the SCS problem.

DNA Sequencing The most common usage of the SCS problem.

DNA Sequencing (cont’d) • The problem: “read” a string of DNA. • Short DNA strands can be read in laboratory. • To sequence a long DNA strand: (The DNA sequence appears in many copies) • Cut the DNA to short fragments using restriction enzymes. • Sequence each of the resulting fragments. • Order those fragments using a SCS algorithm.

The Input Domain The input strings used in the experiments were inspired by DNA sequencing:

Input Generation Setup: Parameters NB: increasing number of blocks results in exponential growth of the problem’s complexity.

Simple Genetic Algorithm produce an initial population of individuals evaluate fitness of all individuals while termination condition not met do select fitter individuals for reproduction recombine individuals mutate individuals evaluate fitness of modified individuals generate a new population end while

Simple GA for the SCS Problem • Given a set of strings as input, generate initial population of random candidate solutions. • The fitness of each individual depends on its length and accuracy. • The GA uses selection, recombination, and mutation to create the next generation, each individual of which is then evaluated. • Theses steps are repeated a predefined number of times or until the solution is deemed satisfactory.

Simple GA for the SCS Problem (cont’d) • Blocks of the input set are atomic components. • Representation: An individual’s genome is represented as a sequence of blocks. • An individual may have missing blocks or contain duplicate copies of the same block. • Permutation Representation: Good or Bad?

Simple GA for the SCS Problem (cont’d) • Evaluation: fitness of an individual is the length of it’s compressed genome + the total length of the blocks that are not covered by the individual. • Genetic operators: • Fitness proportionate selection. • Two-points recombination. Allows growth and reduction in genome’s length. • Block-change mutation.

Simple GA for the SCS Problem (example) • S = {s1,s2,s3,s4}; s1 = 0011, s2 = 1100, s3 = 1001, s4 = 111. • Fitness (< s2,s1>) = |110011| + |111| = 6 + 3 = 9. • Fitness (< s4,s2,s1,s4>) = |11100111| = 8. • Recombination: • p1 = <s1,|s2,s3|,s4> • p2 = <s4,|s1,s3,s2|> • p3 = recombine1(p1,p2) = <s1,s1,s3,s2,s4> • p4 = recombine2(p1,p2) = <s4,s2,s3> • mutate (<s1,s2,s2>) = <s1,s4,s2>

Coevolution • Simultaneous evolution of two or more species with coupled fitness. • Coevolving species either compete or cooperate. • Competitive coevolution: Fitness of individual based on direct competition with individuals of other species, which in turn evolve separately in their own populations (“prey-predator”).

Cooperative Coevolution

Cooperative Coevolution (cont’d) • Cooperative Coevolution involves a number of independently evolving species. • Interaction between species occurs via fitness function only. • The fitness of an individual depends on its ability to collaborate with individuals from other species.

Cooperative Coevolution (cont’d) Source: Potter & DeJong (1997)

Cooperative Coevolutionary Algorithm for the SCS Problem • Two species evolve simultaneously. • First species contains prefixes of candidate solutions to the SCS problem at hand. • Second species contains candidate suffixes. • Fitness of an individual in each species depends on how good it interacts with representatives from other species to construct a global solution.

Suffix Representative Individual Prefixes population Suffixes population Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process) Merge

Fitness Prefixes population Suffixes population Cooperative Coevolutionary Algorithm for the SCS Problem (evaluation process) Evaluate

Experiments Compare: GREEDY, Standard GA, Cooperative Coevolution

Experimental Setup Each type of GA was executed twice on each problem instance; the better run of the two was used for statistical purposes.

Results: Experiment I (~50 blocks)

Results: Experiment II (~80 blocks)

Results: Summary Average of the best superstring lengths Algorithm Problem size GREEDY Genetic Cooperative 50 blocks 80 blocks

Conclusion: The collaboration between the two populations results in a good decomposition of the problem into two smaller sub-problems, each is solved using a standard GA.

The Puzzle Algorithm

The Schema Theorem “Short, low-order, above-average schemata receive exponentially increasing trials in subsequent generations of a genetic algorithm.” Holland (1975)

Building Blocks Hypothesis “A genetic algorithm seeks near-optimal performance through the juxtaposition of short, low-order, high-performance schemata, called the building blocks.”

Our Interpretation “The success of GAs stems from their ability to combine quality sub-solutions (building blocks) from separate individuals in order to form better global solutions.”

The Main Assumption Problems in nature have an inherent structural design. Even when the structure is not known explicitly GAs detect it implicitly and gradually enhance good building blocks.

A Problem Recombination may destroy quality building blocks found by the GA.

Brain Appearance 0010101010101010101000011110100010000 Example

1. Smart (assumable) 2. Blond But not very beautiful… Example (con’t) Brain Appearance 0010101010101010101000011110100010000

The Preservation of Favoured Building Blocks in the Struggle for Fitness: The Puzzle Algorithm

Puzzle Algorithm: The Idea • Improve Recombination Operator. • Preserve good building blocks discovered by GA using selection of recombination loci that do not destroy good building blocks. • Result: Assembly of good building blocks to construct better solutions (as in a puzzle).

Building blocks population Candidate solutions population Puzzle Algorithm (cont’d) • Two populations: 1. Candidate solutions: As in simple GA. 2. Building blocks: Each individual is a sequence of blocks contained in at least one candidate solution.

Building blocks population Candidate solutions population Puzzle Algorithm (cont’d) • Interaction between candidate solutions and building blocks is through fitness function. • Interaction between building blocks and candidate solutions is through constraints on recombination points. Fitness evaluation Crossover location

Candidate solutions population Building blocks population Fitness evaluation each individual is a sequence of blocks Crossover location Puzzle Algorithm: Zoom In

Candidate solutions population Building blocks population Fitness evaluation each building block is contained in at least one individual in the solutions population Crossover location overlapping building blocks Puzzle Algorithm: Zoom In

Fitness evaluation Building blocks population Candidate solutions population Crossover location The Candidate Solutions Population • Representation, fitness evaluation, selection, and mutation are identical to the simple GA. • Recombination-aid vector aids in selecting the recombination loci. • Recombination-aid vector is updated by building blocks individuals.

Fitness evaluation Building blocks population Candidate solutions population Crossover location The Building Blocks Population • An individual is represented as a sequence of blocks, contained in at least one candidate solution. • Fitness of an individual is the average of the fitness of candidate solutions containing it. • Fitness-proportionate selection.

Fitness evaluation Building blocks population Candidate solutions population Crossover location The Building Blocks Population (con’t) • “Unisex” individuals. • Two modification operators: • Expansion: Increase it’s genome by one block. Occurs with high probability. • Exploration: “Die”, and start over as a new 2-block individual. Occurs with low probability.

Candidate solutions population Building blocks population Building Blocks – Candidate Solutions Fitness evaluation f1 f2 f3 f4

Candidate solutions population Building blocks population f3 f2 f1 f1 f2 f3 f4 Building Blocks – Candidate Solutions Fitness evaluation f1 f2 f3 f4 Update “recombination-aid” vector

Recombination-aid vector 0 0 0 0 0 0 0 Solution’s genome building block #1 fitness = 0.3 building block #2 fitness = 0.4 building block #3 fitness = 0.6 Update Recombination-aid vector

Coevolving Solutions to the Shortest Common Superstring Problem