Phylogenetic Trees

Phylogenetic Trees Presenter: Michael Tung mtung@cs.stanford.edu

Overview • Common definitions • Motivation • Phylogenetic Inference • Probabilistic model of evolution • The EM framework • Simultaneous Alignment and Phylogeny • Improvements to SATCHMO

Def: Phylogenetic Tree • N current day (species || sequences || letters) form the leaves of a (binary) tree • N-1 internal nodes correspond to events of divergence (~ past-time species) • A phylogenetic tree (T,t) is parameterized by a topology T (simply the set of edges) and a vector t (edge lengths)

Motivation • Lots of sequence data! • Cost of collecting additional data is decreasing • Little understanding of evolutionary divergence • Inferring the Tree of Life (species tree)

Probabilistic Model of Evolution Evolution of a single position • Standard Assumptions • Lack of Memory • Transitions can be described by a single matrix of conditionals • Reversibility • Assume a prior distribution over states

What’s the probability of an observation given the tree? Probability of a complete assignment of nodes: But, we only see the leaves. Probabilistic Interpretation

Probabilistic Interpretation • Notice that the tree topology enforces local probabilistic influence • A set of sequences are composed of the individual letters • Problem: MSA required? • We can perform inference on the graphical model by dynamic programming

DP on trees • We can exploit the tree structure to compute these probabilities in linear time

Maximum Likelihood • We want the most likely tree (in the probabilistic sense) • The most likely tree is the one that maximizes the probability of your data occurring • Assumption: each observation is independently drawn from this distribution (true?)

Maximum Likelihood • Now, simply find the parameters (T,t) that maximize this likelihood. • Well, not that simple. • Iterative algorithm: Expectation Maximization(EM)

EM • Expectation-Maximization is a framework for optimizing a model. • E-step: estimate the posterior probability of the missing data using the current model • M-step: maximize the expected log-likelihood using the posterior probabilities

Expected Log-likelihood • Rewrite the likelihood Key Insight: the most likely tree is the Maximum Spanning Tree!

Edge case:Transforming a tree into an equivalent bifurcating tree • Maximum spanning tree could return non-phylogenetic trees • Simple transformation preserves likelihood

Avoiding local optima • Greedy optimization (hill-climbing) doesn’t guarantee a global optima • One solution is Simulated Annealing • Temperature parameter • Perturbed edge weights W

Summary of the structural EM • Start with a “good” tree topology (perhaps NJ) • Optimize edge lengths • Optimize Tree • Make T a binary tree • Perturb weights • Iterate

SATCHMO = Simultaneous Alignment and Tree Construction using Hidden Markov mOdels! {set of sequences} to {tree with MSA at each internal node} What's SATCHMO?

...its too slow. SATCHMO is computationally prohibitive. For ~200 seqs: ClustalW takes around 3 minutes SATCHMO takes 1 hour and 30 minutes Solution: Pre-cluster sequences to jumpstart the tree construction Parallelization However...

In a binary tree the work increases exponentially near the leaves. If we can cut even a small # of levels, we have done almost all of the work. How? Use a computationally palatable cluster Build down using BETE Build up using SATCHMO Giving SATCHMO a jumpstart

Idea: Let's utilize the whole cluster instead of just one CPU on one machine. Make interactive mode feasible. Computationand data are distributed across machines. What is the parallelization architecture to optimize complexity/latency/caching behavior/bandwidth/paging/load-balancing ? Parallel caveats: Non-determinism Fragility Data type Parallelizing SATCHMO

An all-against-all computation takes place when we are computing the initial distance matrix. This is the most compute- intensive portion of SATCHMO. Each MSA needs to be scored against each HMM. Distribute the HMMs and MSAs. Rotate the MSAs. MSAs MSAs MSAs scores HMMs HMMs HMMs W1 M W2 W3 Parallelizing the all-against-all: v1

Implementing blocked communications... Parallelizing the all-against-all: v2 MSAs MSAs MSAs scores HMMs HMMs HMMs W1 M W2 W3

Implementing Load-balancing.. MSAs MSAs work request indices scores HMMs HMMs M W2 W1 Parallelizing the all-against-all: v3

Combination of jumpstarting and parallelization can do 200 seqs in 29.90 seconds. v3 performs the best (as expected) More perf testing remains to be done Perf testing being conducted on the NERSC Seaborg. Speedup plot! Performance Results

Phylogenetic Trees