1 / 47

CS 394C Algorithms for Computational Biology

CS 394C Algorithms for Computational Biology. Tandy Warnow Fall 2009. Biology: 21st century’s Science.

argus
Download Presentation

CS 394C Algorithms for Computational Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 394CAlgorithms for Computational Biology Tandy Warnow Fall 2009

  2. Biology: 21st century’s Science “We can safely say that the 20th century has been the century of physics….And, according to the prominent Dutch physicist, Frans Saris, the hegemony of physics in the scientific world is over.Saris proclaims the 21st century the century of biology.” http://www.freedomlab.org/2009/01/27/the-hegemony-ended-already/the 21st century the century of biology.

  3. Biology: 21st Century Science! “When the human genome was sequenced seven years ago, scientists knew that most of the major scientific discoveries of the 21st century would be in biology.” January 1, 2008, guardian.co.uk

  4. Biological topics: Function and structure prediction Gene clustering Metagenomics Microarray analysis Molecular modelling Multiple sequence alignment Neuroscience Phylogenetic estimation Protein docking Protein-protein interactions Sequence assembly Computer science topics: Approximation algorithms Combinatorial optimization Computational analysis Computational complexity Computational geometry Computational image processing Computational topology Databases Data mining Graph-theory and algorithms Machine learning Neural networks Probability theory Scientific visualization Computational Biology: Just about anything goes!

  5. Genome Sequencing Projects: Started with the Human Genome Project

  6. Whole Genome Shotgun Sequencing: Graph Algorithms and Combinatorial Optimization!

  7. Where did humans come from, and how did they move throughout the globe? The 1000 Genome Project: using human genetic variation to better treat diseases

  8. Other Genome Projects! (Neandertals, Wooly Mammoths, and more ordinary creatures…)

  9. Metagenomics: C. Ventner et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

  10. How did life evolve on earth? Current methods often use months to estimate trees on 1000 DNA sequences Our objective: More accurate trees and alignments on 500,000 sequences in under a week We prove theorems using graph theory and probability theory, and our algorithms are studied on real and simulated data. Courtesy of the Tree of Life project Warnow and Linder Lab

  11. This course • Fundamental mathematics of phylogeny and alignment estimation • Applied research problems: • Metagenomics • Supertrees • Simultaneous estimation of alignments and trees • Reticulate evolution • Historical linguistics

  12. Phylogenetic trees can be based upon morphology

  13. But some estimations need DNA! Orangutan Gorilla Chimpanzee Human

  14. Local optimum Cost Global optimum Phylogenetic trees Phylogenetic reconstruction methods • Polynomial time distance-based methods: Neighbor Joining, FastME, Weighbor, etc. 2. Hill-climbing heuristics for NP-hard optimization criteria (Maximum Parsimony and Maximum Likelihood) • Bayesian methods

  15. But solving this problem exactly is … unlikely

  16. “Boosting” MP heuristics • We use “Disk-covering methods” (DCMs) to improve heuristic searches for MP and ML Base method M DCM DCM-M

  17. Rec-I-DCM3 significantly improves performance (Roshan et al.) Current best techniques DCM boosted version of best techniques Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

  18. DCM1-boosting distance-based methods[Nakhleh et al. ISMB 2001] • Theorem: DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 0 0 400 800 1200 1600 No. Taxa

  19. Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA…

  20. Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA…

  21. Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA…

  22. Deletion Mutation The true pairwise alignment is: …ACGGTGCAGTTACCA… …AC----CAGTCACCA… …ACGGTGCAGTTACCA… …ACCAGTCACCA… The true multiple alignment on a set of homologous sequences is obtained by tracing their evolutionary history, and extending the pairwise alignments on the edges to a multiple alignment on the leaf sequences.

  23. X U Y V W AGTGGAT TATGCCCA TATGACTT AGCCCTA AGCCCGCTT U V W X Y

  24. SATé Algorithm (Liu et al., Warnow-Linder lab) SATé = Simultaneous Alignment and Tree Estimation, Science 2009 Obtain initial alignment and estimated ML tree T T Use new tree (T) to compute new alignment (A) Estimate ML tree on new alignment A

  25. SATé Results on 1000 taxon datasets • 24 hour SATé analysis • Other simultaneous estimation methods cannot run on large datasets

  26. Reconstructing the Tree of Life Challenges: - millions of species - lots of missing data Two possible approaches: - Combined Analysis - Supertree Methods

  27. gene 1 gene 3 S1 TCTAATGGAA S1 S2 gene 2 GCTAAGGGAA TATTGATACA S3 S3 TCTAAGGGAA TCTTGATACC S4 S4 S4 TCTAACGGAA GGTAACCCTC TAGTGATGCA S7 S5 TCTAATGGAC GCTAAACCTC S7 TAGTGATGCA S8 S6 TATAACGGAA GGTGACCATC S8 CATTCATACC S7 GCTAAACCTC Combined Analysis Methods

  28. Combined Analysis gene 2 gene 3 gene 1 S1 TCTAATGGAA ? ? ? ? ? ? ? ? ? ? TATTGATACA S2 GCTAAGGGAA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S3 TCTAAGGGAA TCTTGATACC ? ? ? ? ? ? ? ? ? ? S4 TCTAACGGAA GGTAACCCTC TAGTGATGCA S5 GCTAAACCTC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S6 GGTGACCATC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S7 TCTAATGGAC GCTAAACCTC TAGTGATGCA S8 TATAACGGAA CATTCATACC ? ? ? ? ? ? ? ? ? ?

  29. Two competing approaches gene 1gene 2 . . . gene k . . . Analyze separately . . . Supertree Method Species Combined Analysis

  30. MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI SDM Q-imputation PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more ... Matrix Representation with Parsimony (Most commonly used) Many Supertree Methods

  31. Superfine: a new supertree method(Swenson, et al., Warnow-Linder Lab) Input: unrooted source trees Output: unrooted tree on the entire set of taxa Not yet submitted for publication

  32. Algorithmic strategy of SuperFine • First, construct a supertree with low false positives The Strict Consensus Merger • Then, refine the tree to reduce false negatives by resolving each polytomy Quartet Max Cut

  33. False Negative Rate Scaffold Density (%)

  34. False Negative Rate Scaffold Density (%)

  35. Running Time SuperFine vs. MRP MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%)

  36. An open problem in supertree estimation • Input: a collection of source trees on subsets of a set S of taxa • Output: tree T on the full set of taxa that minimizes the sum of the topological distances to the source trees This is NP-hard. How can we solve this effectively?

  37. Historical linguistics • Languages evolve, just like biological species. • How can we determine how languages evolve? • How can we use information on language evolution, to determine how human populations moved across the globe?

  38. Questions about Indo-European (IE) • How did the IE family of languages evolve? • Where is the IE homeland? • When did Proto-IE “end”? • What was life like for the speakers of proto-Indo-European (PIE)?

  39. The Anatolian hypothesis (from wikipedia.org) Date for PIE ~7000 BCE

  40. The Kurgan Expansion • Date of PIE ~4000 BCE. • Map of Indo-European migrations from ca. 4000 to 1000 BC according to the Kurgan model • From http://indo-european.eu/wiki

  41. Estimating the date and homeland of the proto-Indo-Europeans • Step 1: Estimate the phylogeny • Step 2: Reconstruct words for proto-Indo-European (and for intermediate proto-languages) • Step 3: Use archaeological evidence to constrain dates and geographic locations of the proto-languages

  42. “Perfect Phylogenetic Network” (Nakhleh et al., Language)

  43. Reticulate evolution • Not all evolution is tree-like: • Horizontal gene transfer • Hybrid speciation • How can we detect reticulate evolution?

  44. Metagenomics • Input: set of sequences • Output: a tree on the set of sequences, indicating the species identification of each sequence • Issue: the sequences are not globally alignable, and there are often thousands (or more) of the sequences

  45. Course Details • Phylogeny and multiple sequence alignment are the basis of almost everyting in the course • The first 1/3 of the class will provide the basics of the material • The next 2/3 will go into depth into selected topics

  46. Course details • There is no textbook • I will provide notes (online) • We will read papers from the scientific literature Your course project will most likely be a literature survey. If you prefer, you can engage in research!

  47. Grading • Homeworks 50% • Final exam 25% • Class participation 10% • Class project 15%

More Related