Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study

Fast and Accurate Reconstructionof Evolutionary Trees: a Model-based Study Ming-Yang Kao Department of Computer ScienceNorthwestern University Evanston, Illinois U. S. A.

Perspectives computer science biology Use biology ideas to solve computer science problems Use computer science tools to solve biology problems this talk

Use Biology to Solve CS Problems • DNA Computing • DNA Self-Assembly • Genetic Algorithms • Neural Network • Others

Use CS to Solve Biology Problems • Bioinformatics or Computational Biology data mining (this talk) • Related fields computational neuroscience computational ecology medical informatics … many more ...

Example Research Areas of Bioinformatics • DNA sequencing • DNA microarray analysis • DNA self-assembly for nano-structures • DNA word design • RNA secondary structure prediction • Protein sequencing (my talk #4) • Proteomics • Protein database search • Protein sequence design (my talk #3) • Protein landscape analysis • Phylogeny reconstruction (this talk) • Phylogeny comparison (my talk #1)

Evolutionary Trees definition: a tree with distinct labels at leaves leaf labels: species, organisms, DNAs, RNAs, proteins, features, etc. ancestral species wheat rice peach plum bird (Just a joke!) present-day species

Evolutionary Trees leaf labels: DNA sequences wheat rice CGGC CGGG peach plum bird CCAT CCAG AAGT (Just a joke!)

Problem Formulation Input: DNA sequences of present-day species Output: the true evolutionary tree Question: What is “true”? Need a model! wheat rice CGGC CGGG peach plum bird CCAT AAGT CCAG (Just a joke!)

A Fundamental Problem of Biology • Since the time of Charles Darwin, • Problem:reconstruct the evolutionary history of all known species. • Importance: • intellectually fascinating • practical benefits – medicine, food … • Charles Robert Darwin --- 1809-1882 • Origin of Species --- 1859

Main Difficulties • Availability of data • Hundreds of millions of species --- unlikely to be all available any time soon or ever. • But DNA sequences of more and more species are becoming available. • Extracting information from data • focus of this talk

Today’s Technical Focus Input: DNA sequences of present-day species Output: the true evolutionary tree Question: What is “true”? Need a model! Collaborators: Csuros & Kim wheat rice CGGC CGGG peach plum bird CCAT AAGT CCAG

Main Result An algorithm that constructs an evolutionary tree from biomolecular sequences • Provable high accuracy • Short sequence length • Optimal running time • Optimal memory space

Outline of Technical Discussion • Define the model of evolution. • Formulate the computational problem. • Discuss the theoretical performance of our algorithm. • Discuss the empirical performance. • Describe and analyze the algorithm. • Further research.

Outline of Technical Discussion (1) • Define the model of evolution. • Formulate the computational problem. • Discuss the theoretical performance of our algorithm. • Discuss the empirical performance. • Describe and analyze the algorithm. • Further research.

Model of EvolutionIntuitions ACGTACT AGTTCCT AGGAGAA CAGGAGTTTTAA • Mutation occurs probabilistically. • edge length ~ time • edge length ~ mutation probability • edge length ~ dissimilarity (or distance)

Jukes-Cantor Model of Evolution (1)Edge Mutation Probability A X • No insertion or deletion. • X = A with probability 1 - 0.6 = 0.4 • X = C, G, or T with probability 0.6/3 = 0.2

Jukes-Cantor Model of Evolution (2)Independent Mutations along All Edges A 0.2 0.6 A C 0.65 0.7 G G

Jukes-Cantor Model of Evolution (3)i.i.d. mutations at every character AAGT 0.2 0.6 AGTT CAGG 0.65 0.7 GGTG GTTG

True Tree (not known to algorithm) Problem Formulation CAGGT 0.3 0.2 CGTTT AGTGT 0.2 0.5 0.7 0.6 CGTGT ATCGT CAGGT GTACT • Pick any sequence for the root • (also unknown to algorithm). • Generate the other sequences. 0.7 0.1 GGTAC TGGAC Input: but not the other sequences, nor the tree. unrooted Output:

Computational Objectives • Minimize: • running time • memory space • probability of incorrect output • sample size, i.e., length of the input sequences Input: DNA sequences Output:

Triplets • A triplet is one formed by three leaves. • P is thecenter of XYZ. X P Z Y

G-depth of Triplet X Z Y # of edges between X and Y 5, 8, 7

G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree g-depth = 4 the best case

G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree g-depth = 2 log n the worst case

G-depth of a Tree the smallest d such that the triplets of g-depth at most d covers the entire tree • at most 2 log n • can be O(1)

Our New Result (1)

Our New Result (2) polynomial sample size

Our New Result (3) polynomial sample size provable high accuracy

Our New Result (4) polynomial sample size provable high accuracy optimal time & space

Comparison with Previous Results this talk

Experimental Study Design • Step 1 -- Pick a model tree T. • Step 2 -- Use T to generate sequences. • Step 3 -- Use an algorithm to reconstruct a tree T’ from the sequences (without knowing T). • Step 4 -- Compare T’ and T.

Wrong and Right Edges true tree X1 X3 X4 X2 X5 X3 X1 bad good X4 X2 reconstructed tree X5

Experiment #1 • the 135-taxon African-Eve tree (courtesy of Huson and Maddison) • algorithms compared: HGT and bioNJ (Olivier Gascuel) • parameters: sequence length and percentage of wrong edges • edge mutation probabilities: between 0.47 and 0.088 • # of simulations = 20 per sequence length • more experiments in progress

135-taxon African Eve Tree

Results of Experiment #1

Experiment #2 • a 1892-taxon tree of eukaryotes • algorithms compared: HGT and bioNJ • parameters: sequence length and percentage of wrong edges • edge mutation probabilities: between 0.47 and 0.088 • # of simulations = 20 per sequence length • more experiments in progress • several variants of the basic HGT

Results of Experiment #2

Our New Result (4) polynomial sample size provable high accuracy optimal time & space

Outline of Technical Discussion (5) • Describe the HGT algorithm. • Prove the sample size bound (and high probability for accuracy). • Prove the optimal time & space.

Outline of Technical Discussion (5/1) • Describe the HGT algorithm. • Prove the sample size bound (and high probability for accuracy). • Prove the optimal time & space.

Closeness and Distance of Two Leaves AAGT 0.2 AGTT X CAGG Closeness is multiplicative. Distance is additive!!! 0.65 0.7 The larger the closeness, the more accurately we can estimate the distance. GGTG Y GTTG

Closeness = Cubic Root of Determinant AAGT CAGG A C G T

Closeness of Triplet The larger the closeness, the more accurately we can estimate the three pairwise distances. AAGT 0.2 AGTT X CAGG 0.65 0.7 GGTG Y GTTG Z

Assemble Triplets Into Treevia Distance Additivity (I) P c a b X A Y P 6 25 3 X A Y

Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study