1 / 24

Phylogenetic Trees

Phylogenetic Trees. Presenter: Michael Tung mtung@cs.stanford.edu. Overview. Common definitions Motivation Phylogenetic Inference Probabilistic model of evolution The EM framework Simultaneous Alignment and Phylogeny Improvements to SATCHMO. Def: Phylogenetic Tree.

kieu
Download Presentation

Phylogenetic Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogenetic Trees Presenter: Michael Tung mtung@cs.stanford.edu

  2. Overview • Common definitions • Motivation • Phylogenetic Inference • Probabilistic model of evolution • The EM framework • Simultaneous Alignment and Phylogeny • Improvements to SATCHMO

  3. Def: Phylogenetic Tree • N current day (species || sequences || letters) form the leaves of a (binary) tree • N-1 internal nodes correspond to events of divergence (~ past-time species) • A phylogenetic tree (T,t) is parameterized by a topology T (simply the set of edges) and a vector t (edge lengths)

  4. Motivation • Lots of sequence data! • Cost of collecting additional data is decreasing • Little understanding of evolutionary divergence • Inferring the Tree of Life (species tree)

  5. Probabilistic Model of Evolution Evolution of a single position • Standard Assumptions • Lack of Memory • Transitions can be described by a single matrix of conditionals • Reversibility • Assume a prior distribution over states

  6. What’s the probability of an observation given the tree? Probability of a complete assignment of nodes: But, we only see the leaves. Probabilistic Interpretation

  7. Probabilistic Interpretation • Notice that the tree topology enforces local probabilistic influence • A set of sequences are composed of the individual letters • Problem: MSA required? • We can perform inference on the graphical model by dynamic programming

  8. DP on trees • We can exploit the tree structure to compute these probabilities in linear time

  9. Maximum Likelihood • We want the most likely tree (in the probabilistic sense) • The most likely tree is the one that maximizes the probability of your data occurring • Assumption: each observation is independently drawn from this distribution (true?)

  10. Maximum Likelihood • Now, simply find the parameters (T,t) that maximize this likelihood. • Well, not that simple. • Iterative algorithm: Expectation Maximization(EM)

  11. EM • Expectation-Maximization is a framework for optimizing a model. • E-step: estimate the posterior probability of the missing data using the current model • M-step: maximize the expected log-likelihood using the posterior probabilities

  12. Expected Log-likelihood • Rewrite the likelihood Key Insight: the most likely tree is the Maximum Spanning Tree!

  13. Edge case:Transforming a tree into an equivalent bifurcating tree • Maximum spanning tree could return non-phylogenetic trees • Simple transformation preserves likelihood

  14. Avoiding local optima • Greedy optimization (hill-climbing) doesn’t guarantee a global optima • One solution is Simulated Annealing • Temperature parameter • Perturbed edge weights W

  15. Summary of the structural EM • Start with a “good” tree topology (perhaps NJ) • Optimize edge lengths • Optimize Tree • Make T a binary tree • Perturb weights • Iterate

  16. SATCHMO = Simultaneous Alignment and Tree Construction using Hidden Markov mOdels! {set of sequences} to {tree with MSA at each internal node} What's SATCHMO?

  17. ...its too slow. SATCHMO is computationally prohibitive. For ~200 seqs: ClustalW takes around 3 minutes SATCHMO takes 1 hour and 30 minutes Solution: Pre-cluster sequences to jumpstart the tree construction Parallelization However...

  18. In a binary tree the work increases exponentially near the leaves. If we can cut even a small # of levels, we have done almost all of the work. How? Use a computationally palatable cluster Build down using BETE Build up using SATCHMO Giving SATCHMO a jumpstart

  19. Idea: Let's utilize the whole cluster instead of just one CPU on one machine. Make interactive mode feasible. Computationand data are distributed across machines. What is the parallelization architecture to optimize complexity/latency/caching behavior/bandwidth/paging/load-balancing ? Parallel caveats: Non-determinism Fragility Data type Parallelizing SATCHMO

  20. An all-against-all computation takes place when we are computing the initial distance matrix. This is the most compute- intensive portion of SATCHMO. Each MSA needs to be scored against each HMM. Distribute the HMMs and MSAs. Rotate the MSAs. MSAs MSAs MSAs scores HMMs HMMs HMMs W1 M W2 W3 Parallelizing the all-against-all: v1

  21. Implementing blocked communications... Parallelizing the all-against-all: v2 MSAs MSAs MSAs scores HMMs HMMs HMMs W1 M W2 W3

  22. Implementing Load-balancing.. MSAs MSAs work request indices scores HMMs HMMs M W2 W1 Parallelizing the all-against-all: v3

  23. Combination of jumpstarting and parallelization can do 200 seqs in 29.90 seconds. v3 performs the best (as expected) More perf testing remains to be done Perf testing being conducted on the NERSC Seaborg. Speedup plot! Performance Results

More Related