260 likes | 362 Views
BioInformatics Consultation Practice 4 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary Tel: +36-309-015-488 E-mail: pauler @ t-online.hu. Content of the Practice.
E N D
BioInformatics Consultation Practice 4 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary Tel: +36-309-015-488 E-mail: pauler@t-online.hu
Content of the Practice • Synergic effects between Molecular Genetics and Decision Theory: • Effects of Molecular Genetics to Decision Theory: Evolution is more than biology • Optimization problems: Example OP • Solution options: Trial-and-error • Grouping of optimization methods: • Global optimization (GO) • Analytic Optimization • Gradient Algorithms • Simulated Annealing • Mathematical Programming • Linear programming (LP) • Problems of traditional optimization methods • How Moleculear Genetics helps: Simple Genetic Algorithm (SGA): • Chromosome-encoding • Generation cycles: Degree of fitness, Mating, Recombination, Mutation, reproduction • Termination • Evaluation of SGA • How Decision Theory supports Molecular Genetics: • New insights in „Why Evolution works”? • Proof against mainstay of anti-evolutionism • New insights in „Why Evolution works this way”? • Why triplet of nuleotids code an amino acid: Schema Theory • The rationale behind apoptosis: Pros and contras of Elitism • The rationale behind degeneration: Sensitivity Analysis • Role of mutation and recombination points • Role of Genders, Races and Diploid Organisms • Role of transposons and Gene/Exon amplification • Supporting Bioinformatic methods with effective optimization • Home Assignment 4: Play with GA Playground • References
Synergic effects between Molecular Genetics and Decision Theory Tem-pera-ture Pres-sure • Effects of Molecular Genetics to Decision Theory: Evolution is more than biology • Using the basic concepts of biologic evolution and DNA replication resulted in breakthrough in a seemingly distant field of science: Genetic Optimization Algorithms (Genetikus Optimalizáló Algoritmusok) are created in Optimization (Optimalizáció) area of Mathematical Decision Theory(Matematikai döntéselmélet). OK. It’s nice, but why it is important for us? • Because more complex bioinformatic problems described later, eg.: • Gene Search (Génkeresés) • Philogenic Trees (Filogenikus fák) • Protein Structure Analysis (Fehérje térszerkezet elemzés) are all very heavily involved with mathematical optimization with huge computational requriement which stressed the limits of classic optimization methods • Optimization Problems, OP (Optimalizációs Problémák): • Basic definitions: • Goal of an activity is described with Goal function (Célfüggvény) of Continous (Folytonos) (eg. tem- perature, pressure) or Discrete (Diszkrét) (eg. nuc- leotid position) Decision Variables (Döntési változók) • But not all the possible combination of these variables called Solutions (Megoldás) are allowed: there can be Constraints(Korlátozó feltételek) based on physical, technical, social, etc. restrictions in the problem (eg. truck cannot be split into half at a road crossing, bank account balance cannot go into negative, an exon cannot preceed promoter part of a gene) • Only a tiny subset of all solutions - called Feasible solutions (Megvalósítható megoldás) - will comply with constraints • Only a tiny subset of feasible solutions will be Opti- mal Solution (Optimális megoldás), where value of goal function will be minimal (or maximal) • So its all about finding a needle in a haystack!
Effect of Molecular Genetics on Decision Theory:Optimization problems:Examples • This seems to be an un-understandable math blah-blah which makes biologist really-really sick like drinking beer on vodka • However we all solve complex mathematical optimization problems every day, just we do not know about that: • Which items I should leave at home from excursion if my knapsack (Hátizsák) has limited capacity to get maximal uti- lity load? (discrete binary decision variables: put in/leave out) • Whether I should give more oxygene to premature born baby (Koraszülött), then it gets blind, or less, then it will drown (con- tinous decision variable:oxygene level)?Optimization is NOT a perverted hobby of matematicians:life and death depend on it! • At Gene Search (Génkeresés) we search for optimal Alignment(Illeszkedés) of Expressed Sequence Tags (EST) which also comply Gene Stucture (Gene Struktúra) in Contig sequence. The latter is infested with Single Nucleotid Polimorphisms (SNP): insertion, deletion. Additionally can contain reversions, tanspositions.Therefore it is NOT simple sequence matching, but an optimization problem with discrete variables (start/ end nucleotide positons of ESTs)! TCCTTTAAATCCCTTACATGATCTGAGTTCAGACCGGCGTGAGCCAGGTCGGTTTCT
Effect of Molecular Genetics: Optimization problems: Solution Options • Men on the street (Az utca embere) usually solve optimization problems with trial-and-error (Próbálgatás) process: • It can be quite an effective: math optimi-zation methods will not give some „extra-terrastrial better” solution than a motiva-ted and intelligent expert after lot of try. • The main factors are time and cost: if number of decision variables high and goal function is complex (as in reality) number of possible solutions goes into Combinatoric Explosion(Kom.robbanás) • Let us give you a simple and tricky sample: • On a wooden drawing board • We hammer 15 nails randomly aligned • And try to connect them into the shortest possible circle with a rope. How long it would take to find the optimal solution? • Nails can be connected in this many way: 15×14×13×…×3×2×1 = 15! • If we try out 1 version in 30 secs, it will take 1,243,982years 5months 11days • OK. a fast computer could check one solution in 1/10000 secs, so it would take only 4years 2months • This game is called Traveling Salesman Problem (TSP) used to model optimal routing of trucks allocating goods to stores from warehouse should be solved for 30-50 stores for the next day… • But we are biologists,who the hell cares abo-ut routing trucks? There are bad news: TSP is analogue problem with gene search where nails:ESTs,board:contig, rope:gene structure • You have no chance to solve it trial-and error
Effect of Molecular Genetics:Optimization problems:Grouping of Optimization Methods • Global Optimization (GO): • These methods can handle only single goal function but no constraints. However hurting constraints can be sanctioned by a weighted Punishment term (Büntetés tag) and subtracted from goal function value to force the algorithms to search feasible solutions • So GO methods can be used for wide range of optimization problems and they are used in bioinfomatcs extensively. We show them in the order of discovery: • Analytic Optimalization (Analitikus optimalizáció): it computes Optimal solution: • Based on really nasty bits of math named Partial Derivate of Multivariate Functions(Több változós függvények parciális deriválása) and computing Eigenvectors/eigenvalues of Hessian Matrix of Second order partial derivatives (A másodrendű parciális deriváltak Hesse-mátrixának saját vektorai és sajátértékei) (however we cannot see that because software handle it automatically) It is gene-rally true in optimization:the more nasty the math background, the simpler to use the software. Simple, understandable math background result in difficult software with many manual settings to experiment with. • It has relatively low Computational Requirement(Számolásigény): damn fast • It has exact Termination Criteria (Ter- minációs kritérium): we can exactly pre- calculte how long it will run, and when it reaches the optimal solution • It is Optimal(Optimális): there is math proof that it will find optimal solution • Its major disadvantage that it works only with Continous functions (Folytonos függvények) without Breakpoints (Törés- pontok): • Therefore it has a difficulty to incor- porate punishment jumping up sud- denly when constraint is hurt • As nucleotide positions are discrete variables, it is useless in bioinfo
Optimization problems: Grouping of Optimization Methods: Global Optimization (GO) 1 Align-ment Start Pos Stop Pos • Mountain Climb/Gradient Descendent Algorithm(Hegymászó algoritmus): • It works like when a blind man climbs on a hill: he cannot see the peak to reach, but he can „tamper around” him, feeling which direction the terrain is sloping up, then moves that direction a little bit, and tampers again: • It starts from a random solution, which is likely bad, even not feasible • It increases/decreases all decision variable values with a small unit and observes the direction where goal function (with constraint-hurting punishment incorporated) improves steepest. It is called Gradient Vector • It jumps in gradient direction and tampers cyclicly until it cannot improve goal function during number of cycles, so stops • Jumps are decreasing in cycles to avoid over-jump the optimal solution • Simple, relatively low computational requirement • It can stuck into secondary peaks at Multimodal (Többcsúcsú) goal functions • Cannot work very well with discrete variables resulting in „stepped” goal functions, it slows down at „plateaus” • No exact termination, wastes lot of cycles just to figure out time to stop • Non-optimal Heuristics (Heurisztika): optimum can be find only with certain probability, there is no math proof to find it You are middle of nowhere, little asshole! Oh, I am on top now, baby! You are such a flat goal func-tion make me really bored!
Optimization problems: Grouping of Optimization Methods: Global Optimization (GO) 2 • Multistart Gradient (Többszörös indítású): • It is the same as last, just launches more Search point(Keresőpont )simultaneously or after each other from different random start positions. One of them may be good… • Simulated Annealing (Szimulált hűtés): • The scientific-looking name should not mislead anybody: it is basicly the same as mountain climb, just it adds a continously decreasing random part to gradient vector in each cycle, to utilize the „if it got stuck, shake it” blondie-woman-level rule. • This is like when drunken blind guy climbs hill: at the beginning it „drifts” randomly, which prevents to stuck in minor sub-peaks • As it gets less and less drunken and moves slower and more directed, there is a consi-derable chance that it finds the highest peak and will not fall from that • It got its name after that a molecule in an annealing (Kihűlő) material shows decrea-sing random move and obeys gravitation • Simple and relatively fast • Extremly hard to find the best compromi-sed Annealing rate (Hűlési ráta),which pre-vents stucking,but enables capturing peaks • Often this is impossible,because surface of goal function in real problems is broken up by walleys because of punishment hurting constraints: if random is enough strong to jump across another peak,it is too strong to capture the peak • No exact terimination • Non-optimal, heuristics
Optimization problems: Grouping of Optimization Methods: Mathematical Programming x2 Branched part LP Branched part LP x1 • To avoid the bumpy crash-landing in Grand Canyon showed at simulated annealing, this group of methods handle both goal function and constraints directly, without punishment • They has nothing to do with computer programming, they received their name from computing optimal production program in factories • Linear Programming, LP (Lineáris programozás): • Basicallly it can handle linear goal function and linearconstraints with continous variables, which are infrequent in real problems • It bases on a nasty math called Simplex Algorithm(Szimplex algoritmus) • Relatively fast, Optimal with exact termination! • Most types of non-linear functions can be linearized, mostly with the help of using discrete variables • It can handle discrete variables with an auxiliary method calledBranching&Bounding,B&B(Korlátozás-szétválasztás) • The really big problem is that with B&B, computation requirement is exploded roughly 20 times or more(eg. major LP software can handle 100K-1M continous variables, but only 300-5000 discrete ones) So they could solve a real-sized TSP with matehematical proof of optimality, just for the next week instead of tomorrow morning! • Summary of traditional optimization methods: We are in really deep shit with them when 3 problems occour together: • Discrete decision variables (eg. nucleotide positions) • Non-linear, non-linearizable goal function and constraints (eg. primer melting, primer complementarity rate) • Goal function is highly multibodal (eg. at fragment assembly end of one fragment matches with the beginning of several other fragments) Unfortunately they are all there in most of the real problems!
Content of the Practice • Synergic effects between Molecular Genetics and Decision Theory: • Effects of Molecular Genetics to Decision Theory: Evolution is more than biology • Optimization problems: Example OP • Solution options: Trial-and-error • Grouping of optimization methods: • Global optimization (GO) • Analytic Optimization • Gradient Algorithms • Simulated Annealing • Mathematical Programming • Linear programming (LP) • Problems of traditional optimization methods • How Moleculear Genetics helps: Simple Genetic Algorithm (SGA): • Chromosome-encoding • Generation cycles: Degree of fitness, Mating, Recombination, Mutation, reproduction • Termination • Evaluation of SGA • How Decision Theory supports Molecular Genetics: • New insights in „Why Evolution works”? • Proof against mainstay of anti-evolutionism • New insights in „Why Evolution works this way”? • Why triplet of nuleotids code an amino acid: Schema Theory • The rationale behind apoptosis: Pros and contras of Elitism • The rationale behind degeneration: Sensitivity Analysis • Role of mutation and recombination points • Role of Genders, Races and Diploid Organisms • Role of transposons and Gene/Exon amplification • Supporting Bioinformatic methods with effective optimization • Home Assignment 4: Play with GA Playground • References
Effect of Molecular Genetics on Decision Theory: How Genetic Algorithm helps? • Definition of Simple Genetic Algorithm, SGA (Egyszerű Genetikus algoritmus): • A Global Optimization Heuristics based on simplified principles of bio-logic evolution and DNA replication • To give an example for its efficiency lets solve a TSP connecting 48 capitals of continental USA states at shortest way with SGA: • The optimal solution, 3189 miles – cross-checked by other methods – is found only computing way at 90572 variations, which took 180secs on an old 1GHz laptop… • How the hell this is possible? • Be suprised: we cannot just breed nice cats, dogs, or wheat. But „nice” hull (hajótest) form of a tanker ship, optimal route of a truck, or optimal form of a gas-turbine blade (Gáztur-bina-lapát) can also be breeded! • It does not mean that a handsome male turbine blade seduces a pretty female turbine blade and soon they will have baby turbine blades mixing the form of their parents – of course this is impossible in the reality… • But it’s exactly what SGA simulates! • We need the following things to do it:
Effects of Molecular Genetics: Simple Genetic Algorithm: Basic terms 25cm 31° • Example of genetic algorithm: • We will use SGASimulator.xls as a mini-example how genetic algorithm works. It is about our favourite turbine blades measured by 2 variables: • Lng: Lenght, cm – continous variable • Deg: Degree of incidence, degrees – continous variable • Lets assume, that we have to maximize efficiency of blades descibed by a nasty non-linear goal function: Efficiency = Lng×e(-Lng/10)+Deg×e(-Deg/10)-4 → Max • Coding variable values of search population into chromosomes: • Optimum is searched by population (Populáció) of 10search points (Keresőpont): they are possible solutions, represented in Binary Chromosomes(Bináris kromoszóma) consisting Genes (Gén) of binary coded variable values, where the bits are called Locuses(Lokuszok) Eg. if length of a possible turbine blade is 25cm, and angle of incidence is 31°, then it will be represented in the search population as: • There are considerable differences from biology: • Search population size is fixed: roughly 5..50 times number of decision variables • Coding is binary instead of 4 valued nucleotide positions (A,T,C,G). It is not just because computers use binary data storage, it has more important reason described later • Instead of using triplet codons, variables can be coded by different number of bits depending on their Domain (Értékkészlet) and Range(Értékhatár): • Binary variables are coded by 1 bit (eg. 0:leave out from knapsack, 1:put in), • Discrete valued variables are coded in more bits (16, 8, 4, 2, 1) (eg. number of trucks) • While originally continous variables are coded as discrete ones also, using fraction-valued bits (1/2, ¼, 1/8, 1/16…) to represent values with suitable fine resolution (Felbontás) (eg. Temperature in 1/16 C°) • There are only coding parts in the binary chromosome like in cDNA
Simple Genetic Algorithm: Steps of Generation Cycle 1 • Inicialization: Search population contains now 10 search points, whose variables are Initiated (Kezdőérték) with even distributed (Egyenletes eloszlású) random values within their range, then the following operations repeated during several Generations (Generáció): • Degree of Fit (Élőhelyhez illeszkedés): Goal function value of all search points in population are computed translating binary chromosomes into variable values, inputed in goal function. This takes 99.9% of computatonal reqirement. • Computing Mating Probabilities (Párosodási valószínűség) of search points: It is pro-portional to their degree of fit: search point with the lowest degree of fit will have mating probability of 0, while sum of probabilities for the total population is 1 • Mating (Párosodás): 10 search points are mated into 5 pairs according to mating proba-bilities: the higher it is, the search point will take part in more mating, even it can be Self-mated (Önpárzás) as there are no genders yet, so members of pairs are A and B.
Simple Genetic Algorithm: Steps of Generation Cycle 2 • Recombination/Crossover (Kromoszómák rekombinációja) • At every pair we cut the homolog(Homolog) chromosomes of 2 members, at 1 randomly selected Recombination/Crossover point (Rekombinációs pont) and chromosome content is shifted between 2 members after the cut point • Cut point will be different at every pair, and cut can be made not only at the border of genes, but every position has equal chance to become a cut point • So, genetic material of mated pairs is randomly mixed • Replication (Replikáció): • From every mating, 2 descendant search points are born with mixed genetic content into the next generation • Parents are destroyed to keep population size (and memory usage) fixed • Mutation (Mutáció): • In genome of the new generation, value of 0.1..0.5% of bits is changed to a randomly selected {0,1} value with even distribution • Then the whole thing repeats in the new generation
Simple Genetic Algorithm: Steps of Generation Cycle 3 12cm 10° • What happens? Viewing it graphi-cally during nume-rous generations, search points will converge towards peak(s) of goal function with inte-resting zig-zag jumping, non-con-tinous movement: • Termination: The algorithm is terminated if: • Average goal function value • Of best 10% of population members • Does not imp-rove by 1% • During 20 generations
Content of the Practice • Synergic effects between Molecular Genetics and Decision Theory: • Effects of Molecular Genetics to Decision Theory: Evolution is more than biology • Optimization problems: Example OP • Solution options: Trial-and-error • Grouping of optimization methods: • Global optimization (GO) • Analytic Optimization • Gradient Algorithms • Simulated Annealing • Mathematical Programming • Linear programming (LP) • Problems of traditional optimization methods • How Moleculear Genetics helps: Simple Genetic Algorithm (SGA): • Chromosome-encoding • Generation cycles: Degree of fitness, Mating, Recombination, Mutation, reproduction • Termination • Evaluation of SGA • How Decision Theory supports Molecular Genetics: • New insights in „Why Evolution works”? • Proof against mainstay of anti-evolutionism • New insights in „Why Evolution works this way”? • Why triplet of nuleotids code an amino acid: Schema Theory • The rationale behind apoptosis: Pros and contras of Elitism • The rationale behind degeneration: Sensitivity Analysis • Role of mutation and recombination points • Role of Genders, Races and Diploid Organisms • Role of transposons and Gene/Exon amplification • Supporting Bioinformatic methods with effective optimization • Home Assignment 4: Play with GA Playground • References
Simple Genetic Algorithm: Evaluation, Effects on Molecular Genetics • Evaluation of SGA: • Non-optimal heuristics, optimum is found only by a given probability • It has higher computational requirement than other GO heuristics • Although underlying math is dead simple, it has lot of manually set parameters (population size, coding resolutions, termination criteria, tec.) requires experimenting • Non-exact termination criteria: wastes computations • Efficient handling of highly non-linear, multimodal, discrete-valued goal functions • It states only 1 reqirement against goal function: it should be computable from decision variables • How Decision Theory supports Molecular Genetics with discovering SGA: • New insights in the debate „Why Evolution works”? • Proof against mainstay of anti-evolutionism: in the beginning of the new millenium denying evolution became very fashionable, backed up by religious groups, neo-conservative politics in power, and because of disillusionment of masses that science could not solve major problems. • The main argument of these groups that biologic evolution - a purely random process – cannot create such a perfect „machines” like cat’s leg or human brain • So there is a need to assume a transcendent intelligent designer (Intelligens tervező) behind this, who designed cat for cat, man for man, and all processes of DNA replication can cause only insignificant changes during evolution. Really? • In SGA, almost everything is evenly distributed random: (Initiation, Mating, Recombination points, Mutation) except one thing: degree of fitness is strictly proportional with mating probability – and because of it, works pretty well solving highly complex engineering problems. • The same holds for biologic evolution, except that the multi-level control network of gene expressions we just trying to understand is also very far from being random • Annoying example for conservative prophets and predicators: How a 400Kb software written in Java becomes „intelligent, creative designer”? • We re-run SGA in a TSP problem where 19 points are aligned around a circle • Starting randomly, the SGA quickly (2582trials) evolves the only one and trivial optimal solution, and it seems that it „creatively invented the circle” • Of course it did not invent anything, just found the optimal solution in a special problem environment, knowing nothing about geometry…
New insights in „Why Evolution works this way”: Schema Theory 1 • The more you know about Molecular Genetics, the more „overcomplicated” it seems. Although SGA is drastic simplification of biologic evoluti-on, the abstract math simulation can give new insights why that „overcomplications” are necessary keeping the system balanced: • ”Why 3 nucleotides code 1 amino acid?”: Schema Theory (Sémaelmélet) and role of Partial sequence alignment (Részleges Szekvencia Illesztés): • As biologists have enough problem with 64-valued triplet codons, usually they never ask „Why it is not 4 or 5?” or „More than 4 possible nucleotide values in one position would be more or less effective?”. Results suggested by SGA simulation are quite surprising: • SGA is more effective than Gradient Algorithm not because it uses more search points. It can be proved that SGA with 100 population is still better than 100×Multistart Gradient. Why? • While Gradient can pull search points only in continous line, • Simulated Annealing can make only little random jumps around this line (or big, loosing target), • SGA can move search points in big, well-targeted jumps among far distant peaks of a highly mutimodal goal function • How come that’s possible? ? < > =
New insights in „Why Evolution works this way”: Schema Theory 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Deg 0 0 0 0 0 * * * 1 1 1 0 0 0 * * Efficiency 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Deg Lng 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 * 1 * * 1 * * * • It is because in SGA, optima is not searched by searchig points but Schemas(Sémák): • Sequence-masks for binary chromosomes, using 3 possible values in a given position: 1, 0, *, where * is wildcard and can denote both 0 and 1. In biology this is called Partial Aligned Sequence (Részlegesen illeszkedő szekvencia) • Graphically, a schema creates a grid with a special pattern called Binary Space Wave (Bináris térhullám) in the coordinate system of decision variables called Decision Space (Döntési tér) • Compatibility of a schema (Séma kompatibilitás) with a sequence is a percentage of matching positions, where * value accounted as half-matching to both 0 and 1 • Of course, schemas are existing only virtually, if there are many highly compatible search points sharing a common schema. They will appear at peaks of schemas space wave ( ) • Recombination of 2 highly compatible search points won’t destroy their common schema, and their children will also be compatible to schema, just showing up at its different wave peaks • This is how search points can make big targeted jumps by random recombination • There is much bigger variety of schemas (3n) than searching points (2n in case of n bits) • Therefore, if a schema becomes winner (Győztes séma), supported by many high goal function valued searching points, it kills all other schemas • Schema interference (Séma interferencia):Peaks of winner schema will cover well peaks of goal function, so jumping search points will quickly map all optima of a multimodal goal function
New insights in „Why Evolution works this way”: Schema Reduction, Elitism • Schema reduction (Séma redukció): it is mathematically proven that binary coding of chromosomes is the most effective, as: • Recombination this way can mix smallest possible bit of information of genomes • Space grid structures of binary schemas can be the most complex, able to follow peaks complex multimodal goal functions • Implications for biology: • Longer codons from less valued nucleotides would be theoretically more effective, if we ignore biochemical stability issues • Direct coding of amino acids into 1 position with 20 possible values would be totally ineffective, as schemas cannot form any space grid this way • Exact sequence matching has negligible importance compared to partial sequence alignment when searching for complex control mechanisms in genome • ”Why pre-programmed death of cells (Apoptosis) is important in evolution?”: Elitism (Elitizmus) and Convergency (Konvergencia): • Both SGA and biologic evolution is slowed because well-performing units called Elite (Elit) can mate with ill-performing ones with little chance, creating degenerated (Degenerált) units in next generation • If we allow elite (the best 5..10% of population) to be copied in the next generation, convergence of SGA towards optima is accelerated considerably • However there is a nasty stalling effect (Beragadás) if elitism is enforced too strong. An early stage of evolution, the elite which SEEMS to be the best, will very rapidly spread its schema during few generations. So Genetic diversity (Genetikai változatosság) is exterminated and the real optima can never be reached • Mutation can counterbalance this by increasing diversity, but it happens randomly resulting 10000 under-performing mutant for 1 over-performing. • So it is not real counter-balance of over-elitism, just last remedy with slight hope • Apoptosis is one possible effective counterbalance, limiting direct genetic info stream during several generations
New insights in „Why Evolution works this way”: Degeneration, Mutation, Cutpoints • „Why it is not explicitely forbidden for elite to mate with degenerated elements?”: Sensitivity Analysis (Érzékenységi teszt): • As we mentioned it slows convergency considerably. Then why it is still there? • Because optimal solution is not just judged by its goal function value, but also by its Sensitivity (Érzékenység): how badly it is affected when decision variables change a little bit randomly, uncontrollably? • Annoying example for wannabe broker biologists: • At FOREX trade, your investment (and your profit from favourable currency price change…) is leveraged by 100×, 250×, or even 400× by FOREX provider. However if price just moves 1/100, 1/250, or 1/400 units in unfavourable direction, you lost everything! • At bank deposit, you will get 6..7% percent of your investment as profit, as long as the bank does not go into bankruptcy • Which one would you choose? • Making things even more nasty, optimality and stability are working against each other: • A highly precisely optimized car manufacturing line is inflexible: it will get stuck if any of the key suppliers fall out • The Hungarian (un)health system makes incredible wastes, highly ineffective, but can flexibly resist to any reform, as hospitals try to make hidden last reserves • Degenerate mating perform sensitivity analysis of elite: if it is stabile and well founded, after degeneration it will develope itself again in just a few generations • „Mutation or recombination is more important?”: • At dawn of Molecular Genetics mutations were thought much more important than site-specific- and general recombination, but this view changed rapidly • SGA works well without any mutation – it just slightly increases chance of get stuck. But it is absolutely useless without recombination, mutation alone is an inferior random search • „More dense recombination points are more effective or not?”: • In SGA there is an empirically founded rule of thumb that in most problems ca. 1cut/ 1000bits of chromosome lenght is the most effective • More dense mutiple cuts may mix genetic material better and faster, but also cut vital information too short parts to be effective
New insights in „Why Evolution works this way”: Genders, Races, Diploidity • „Why boys have cock and girls have cunt?”: Role of Genders (Nemek): • In biology most of the higher order organisms have 2 genders • If we half the population into 2 subgroups, which cannot mate inside, just across groups, it means an effective counterbalance against stucking effect of over-elitism • Much higher elit rate can be maintained safely than in unisex SGA, with only slightly more difficult algorithm (mating probabilities has to be calculated separately at males/females) • ”Why the hedgehog and the anaconda snake cannot have a baby barbed wire?”: Role of Races (Fajok): • If mating limit is imposed among search points with very low compatibility chromosomes, it enchances to discover distant alternative optima of a highly multimodal goal function • Just like in biology, each race tries to occupy a separate niche • „Why higher order organisms have multiple set of chromosomes”: Role of Diploidity (Diploiditás): • Its usage in SGA pays off if goal function is shifting among some alternative versions in time faster than optimal solution can be computed • Chromosomes are doubled and not all their bits will be coding: usually one half of bits control dominance of {0, 1} values in the gene-coding half, so memory requirement is quardupled • Variable values are translated from dominant chromosome values • With this solution, dominance control is maintained within the framework of SGA also, and controlling is evolved together with the controlled system. (We can see similar recursion of evolution in biology at euchariotes when general recombination cuts in introns or promoter also!) • In recessive values, genetic code of alternative searching points can be hidden from evolution until unfavorable version of goal function rules • When goal function shifts rapidly, recessive code shifts into dominant, enabling very rapid transformation of phenotype for survival
New insights in „Why Evolution works this way”: Transposons, Amplification • „Are transposons just genomic parasites”: Sequence optimizing problems: • At solving TSP and similar problems, SGA using transposition mutation is way more effective, because it can effectively shift sequence of genes leaving their content relatively untouched • „Are Gene/Exon amplifications just waste space in genome?”: Redudant coding: • Storing multiple working copies of a gene and averaging their stored values after translation can be an effective counter-balance against stucking by over-elitism • The final and most important effect of Decision Theory on Molecular Genetics: • Supporting Bioinformatic methods with effective SGA-optimization: The most difficult analysises in bioinfo, eg.: • Gene Search (Génkeresés), • Philogenic Trees (Filogenikus fák), • Protein Structure (Fehérje térszerkezet) Are all nasty optimization problems with highly nonlinear, multimodal goal functions and lot of discrete variables
Content of the Practice • Synergic effects between Molecular Genetics and Decision Theory: • Effects of Molecular Genetics to Decision Theory: Evolution is more than biology • Optimization problems: Example OP • Solution options: Trial-and-error • Grouping of optimization methods: • Global optimization (GO) • Analytic Optimization • Gradient Algorithms • Simulated Annealing • Mathematical Programming • Linear programming (LP) • Problems of traditional optimization methods • How Moleculear Genetics helps: Simple Genetic Algorithm (SGA): • Chromosome-encoding • Generation cycles: Degree of fitness, Mating, Recombination, Mutation, reproduction • Termination • Evaluation of SGA • How Decision Theory supports Molecular Genetics: • New insights in „Why Evolution works”? • Proof against mainstay of anti-evolutionism • New insights in „Why Evolution works this way”? • Why triplet of nuleotids code an amino acid: Schema Theory • The rationale behind apoptosis: Pros and contras of Elitism • The rationale behind degeneration: Sensitivity Analysis • Role of mutation and recombination points • Role of Genders, Races and Diploid Organisms • Role of transposons and Gene/Exon amplification • Supporting Bioinformatic methods with effective optimization • Home Assignment 4: Play with GA Playground • References
Home Assignment 4: Play with GA Playground • Download & Install GA Playground! (3pts) • Download GAPlayGround from the link: http://www.aridolan.com/ga/gaa/gaa.html#Download and extract it into a directory named C:/GAPlGr • Download Java Expression Library from http://galaxy.fzu.cz/JEL/ the file jel-0_9_11.zipshould be extracted into C:/JEL directory • Cpy file JEL.JAR from C:/JEL/Lib directory into C:/GAPlGr and OVERWRITE the similar named JEL.JAR file already there, otherwise it will not work! • Download Java Runtime Environment v1.1 or newer from: http://www.java.com/en/download/index.jsp • Save the text below into a START.BAT file at directory C:/GAPlGr • Launch START.BAT. As at all similar JRE-based applications, don’t close the appearing DOS-window until GAPlayGround runs, otherwise it will not work! • Graphic User Interface will show up in 5-10 seconds • Try out its built-in demo samples! (2pts) • Simulations can be launched clicking the lower dropdown • setCLASSPATH=.;C:\GaPlGr\gaa.jar;C:\GaPlGr\ScsGrid.jar;C:\GaPlGr\tabsplitter.jar;C:\GaPlGr\jel.jar; • java GaaApplet
References • Analytic Optimization: • http://en.wikipedia.org/wiki/Optimization_(mathematics) • Gradient Algorithms: • http://en.wikipedia.org/wiki/Gradient_descent • Simulated Annealing: • http://en.wikipedia.org/wiki/Simulated_annealing • Linear programming (LP): • Gams Inc.: http://www.gams.com/modlib/libhtml/subindx.htm • Optimax Inc.: http://www.maximal-usa.com/ • Frontline Inc.: http://www.solver.com/sdkplatform.htm • Brunel: http://people.brunel.ac.uk/~mastjjb/jeb/or/lp.html • Simple Genetic Algorithm (SGA): • http://cs.felk.cvut.cz/~xobitko/ga/ • http://lancet.mit.edu/~mbwall/presentations/IntroToGAs/ • http://www.geneticprogramming.com/ • Anti-evolutionism, creationism: • http://www.creationism.org/ • Schema Theory: • http://www.springerlink.com/content/g0w8dvy3dupqdhej/ • Apoptosis: • http://en.wikipedia.org/wiki/Apoptosis • Sensitivity Analysis: • http://www.mba.bme.hu/data/jegyzet/koltaitamas/tterv_mba.pdf • SGA software: • http://www.hao.ucar.edu/Public/models/pikaia/pikaia.html • http://kal-el.ugr.es/~jmerelo/GAJS.html • http://www.rennard.org/alife/english/gavgb.html • http://www.sambee.co.th/MazeSolver/mazega.htm • http://ai.bpa.arizona.edu/~mramsey/ga.html