820 likes | 1.1k Views
Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina. Giuseppe Lancia Università di Udine. Human Genome Project ( 1990): read and understand human (and other species ) genome (DNA)
E N D
Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina Giuseppe Lancia Università di Udine
Human Genome Project (1990): read and understand human (and otherspecies) genome (DNA) Scientific and practicalapplications (medicine, agriculture, forensic, ….) multi-millionprojectinvolvingseveralcountries MOLECULAR BIOLOGY
COMPUTER SCIENCE To meet the goal weneedcomputers and programs to deal with problemssuchas • Huge data sets (billions of informations to be analized) • Data Interpretation (contradictory, erroneous or inconsistent data) - Data sharingand networks (online genomic data banks)
Computational (molecular) Biology “Optimization problems arising in the analysis, interpretation and management of large sets of genomic data” • Combinatorics, Discrete Math • Combinatorial Optimization • Integer Programming • Complexity theory (Approximations and Hardness) • Graph theory • (but also Stringology, Data Bases, Neural Networks....)
FIRST PHASE COMPLETION : HUMAN GENOME SEQUENCE - 2001 World Consortiumuniversities and labs Celera Genomics (Craig Venter)
Computational Biology born around ’80-’90 • Algorithmic approaches (e.g. Dynamic Programming for alignment) • Computational complexity (e.g. NP-hardness of folding) • String-related problems, Information retrieval, Genomic data base…. • …… • mostly computer scientists dominated the field
Some NP-hard problems in C.B. are OPTIMIZATION PROBLEMS • These can be solved via mathematical programming techniques • INTEGER LINEAR PROGRAMMING • LAGRANGIAN RELAXATION • SEMIDEFINITE PROGRAMMING • QUADRATIC PROGRAMMING • O.R. people (and O.R. techniques) entered the field
The mostimportantapproach of thistypeisprobably IntegerLinear Programming Itallows the solution of NP-hard problems via specialized Branchand Boundalgorithms The lowerboundcomes from the Linear Programming relaxation of the model (and can be computed in polynomial time)
The I.P. approach • Define integer (usually binary) variables
The I.P. approach • Define integer (usually binary) variables 2. Define linear constraints that feasible solutions must satisfy
The I.P. approach • Define integer (usually binary) variables 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize
The I.P. approach • Define integer (usually binary) variables 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound
The I.P. approach • Define integer (usually binary) variables There can be an exponential number of variables. We need to make them implicit BRANCH-AND-PRICE 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound
The I.P. approach • Define integer (usually binary) variables There can be an exponential number of variables. We need to make them implicit BRANCH-AND-PRICE 2. Define linear constraints that feasible solutions must satisfy There can be an exponential number of constraints. We need to make them implicit BRANCH-AND-CUT 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound
Some Integer Programming models in C.B. Haplotyping Clark’srule (Gusfield) Parsimony(Gusfield, Lancia+Pinotti+Rizzi, Brown+Harrower) Fragmentassembly (Lancia+Bafna+Schwartz+Istrail) Proteinfolding Energy potentials (Wagner+Meller+Elber) Folding ab initio (Carr+Hart+Greenberg) Threading (RAPTOR, Xu+Li+Kim+Xu) Docking (Doye+Leari+Locatelli+Schoen, Althaus+Kohlbacher+Lenhof+Muller) Foldcomparison (Carr+Lancia+Istrail+Walenz,Caprara+Lancia) SequenceAlignment and consensus Lenhof+Vingron+Reinert Althaus+Caprara+Lenhof+Reinert Fischetti+Lancia+Serafini Meneses+Lu+Oliveira+Pardalos PhysicalMapping Alizadeh+Karp+Weisser+Zweig GenomeRearrangements Caprara+Lancia
We’ll see some examples Haplotyping Clark’srule (Gusfield) Parsimony(Gusfield, Lancia+Pinotti+Rizzi, Brown+Harrower) Fragmentassembly (Lancia+Bafna+Schwartz+Istrail) Proteinfolding Energy potentials (Wagner+Meller+Elber) Folding ab initio (Carr+Hart+Greenberg) Threading (RAPTOR, Xu+Li+Kim+Xu) Docking (Doye+Leari+Locatelli+Schoen, Althaus+Kohlbacher+Lenhof+Muller) Foldcomparison (Carr+Lancia+Istrail+Walenz,Caprara+Lancia) SequenceAlignment and consensus Lenhof+Vingron+Reinert Althaus+Caprara+Lenhof+Reinert Fischetti+Lancia+Serafini Meneses+Lu+Oliveira+Pardalos PhysicalMapping Alizadeh+Karp+Weisser+Zweig GenomeRearrangements Caprara+Lancia
A genome is a long string over the DNA alphabet {A,C,G,T} In man it is some 3.000.000.000 letters DNA is responsible for our diversity as well as our similarity Small changes in a genome can make a big difference, like from... to...
Eukariotic diploid organisms CELL Nucleus Chromosomes (pairs) TCATCGA AGTAGCT
THE CENTRAN DOGMA OF MOLECULAR BIOLOGY introns exons attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta attag|ctatatcgttgatg|tatatgcta|cga|aatta A GENE
THE CENTRAN DOGMA OF MOLECULAR BIOLOGY introns exons attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta attag|ctatatcgttgatg|tatatgcta|cga|aatta A GENE codon triplets att agc tat atc gtt gat gta tat gct acg aaa tta R N C A S S F C W Y Q V amino acids: a PROTEIN
THE CENTRAN DOGMA OF MOLECULAR BIOLOGY introns exons attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta attag|ctatatcgttgatg|tatatgcta|cga|aatta A GENE codon triplets att agc tat atc gtt gat gta tat gct acg aaa tta R N C A S S F C W Y Q V amino acids: a PROTEIN The protein folds to a 3D shape to perform its function CENTRAL DOGMA: 1 gene 1 protein
From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA
From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA
From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA
From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA sequencing TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA
From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA sequencing TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA assembly
assembly ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG Weneed to merge the fragments in order to retrieve the original DNA sequence -50,000,000 fragments -1000 chareach… Webetter use computers!
Understanding the problem - Take 10 copies of «Corriere della Sera» - Cuteach in tinypieces (1cm2) - Put the pieces in a bag and shuffle - Grabfivehandful of pieces and throwthemaway Problem: retrieve the newspaper from the remaining pieces Difficulties: Ripeatedwords (e.g., «ha», «dopo», «quando», «governo»…) • Let’smake the problemharder: - Ultra-tinypieces (1mm2) - Itisnot the CDS, but the encyclopedia Treccani (20 volumes) - Itiswritten in chinese!! Stillthrproblemwould be easierthansequencing the human genome
Assembly • Repeatedwords and missingwords create problems to be solved by sophisticated programs, based on statistics and mathematicalmodels. • The basicunderlyingproblem (notwithstanding the abovecomplications) iscalledShortestSuperstringProblem (SSP)
Shortestsuperstring: • Given a set of strings s1, .., sn, find a string s thatcontainseach sias a substring..
Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg
Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg
Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg
Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg
Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg
Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg
Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg
Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg
Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acctcattgtgtgccacctg cattgtgccacctg
Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg The space of potential (non-redundant) solutionshassize O(n!) The problemis NP-Hard Thereis an effectivegreedyalgorithm
Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon
Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere
Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere neronerene
Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere neronerene neronereneon
Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere neronerene neronereneon caneronereneon
Greedyalgorithm: • In this case itfound the optimum, butthereis no guarantee • There’s, however, a guaranteethat v(GREEDY) <= 2.5v(OPT) (i.e., itis a 2.5-APPROX ALGORITHM (Sweedyk)) • OPEN PROBLEM (Conjecture holding from > 20 years): Prove that GREEDY is a 2-APPROX ALGORITHM For more info see http://fileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/superstring.pdf
SequenceAlignments • Sequences evolve and change • E.g.: deletion, insertion, mutation attcgattgat attcggatdeletion attcgattgat attcggatinsertion attcgatgcgmutation Giventwogenomicsequences (e..g., man and mouse) wewant to compare them