1 / 44

Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood. Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum.edu. Outline. Motivation Introduction to phylogenetic tree inference Statistical inference methods

yates
Download Presentation

Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact: stamatak@cs.tum.edu

  2. Outline • Motivation • Introduction to phylogenetic tree inference • Statistical inference methods • Maximum Likelihood & associated problems • Solutions: • 2 simple heuristics • parallel & distributed implementation • Results • Conclusion • Availability & Future Work Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  3. Motivation: Towards a „Tree of Life“ • 30.000 organisms available, current trees <= 1000 Where we are: Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  4. Motivation: Towards a „Tree of Life“ • 30.000 organisms available, current trees <= 1000 Where we want to get: Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  5. Phylogenetic Tree Inference • Input: „good“ multiple alignment of a distinguished, highly conserved part of DNA sequences • Output: unrooted binary tree with the sequences at its leaves (all nodes: either degree 1 or 3) • Various methods for phylogenetic tree inference • Differ in computational complexity and quality of trees • Most accurate methods: Maximum Likelihood Method (ML) and Bayesian Phylogenetic Inference: + most sound and flexible methods + other methods not suited for large/complex trees -- most computationally intensive methods Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  6. ML and Bayesian methods • T.Williams et al (March 2003) comparative analysis with simulated data shows: MrBayes is best program • Guidon et al (May 2003) PHYML very fast & accurate ML program for real & simulated data: faster than MrBayes • ML (PHYML, RAxML2): + Significantly faster than MrBayes + Reference/starting trees for bayesian methods -- Less powerful statistical model • Bayesian Inference (MrBayes): + Powerful statistical model -- MCMC convergence problem • Memory requirements for 1000/10000-taxon alignment: • RAxML: 200MB/750MB • PHYML: 900MB/8.8GB • MrBayes: 1150MB/unknown Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  7. MCMC Convergence Problem Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  8. What does ML compute? • Maximum Likelihood calculates: • Topologies • Branch lengths v[i] • Likelihood of the tree S1 v1 v5 S3 S4 v3 v7 v4 v2 S2 v6 S5 Goal: Find tree topology wich maximizes likelihood Problem I: Number of possible topologies is exponential in n Problem II: Computation of likelihood value + branch length optimization is expensive Solution: Algorithmic Optimizations (previous work) + New heuristics +HPC Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  9. New Heuristics for RAxML • Two common methods to build a tree: • Progressive addition of organisms e.g. stepwise addition algorithm • Use a (random, simple) starting tree containing all organisms and optimize likelihood by application of topological changes • RAxML (Randomized Axelerated Maximum Likelihood) computes parsimony starting tree with dnapars -> fast and relatively „good“ initial likelihood • dnapars uses stepwise addition -> randomized sequence input order to obtain distinct starting trees • Optimize starting tree by application of rearrangements • Accelerate rearrangements by two simple ideas Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  10. ST2 ST1 ST3 ST6 ST4 ST5 Subtree Rearrangements Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  11. Subtree Rearrangements ST2 ST1 ST3 ST6 ST4 ST5 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  12. Subtree Rearrangements +1 ST2 ST1 ST3 ST6 ST4 ST5 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  13. Subtree Rearrangements +1 ST2 ST1 ST3 ST6 ST4 ST5 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  14. Subtree Rearrangements +1 ST6 ST2 ST1 ST3 ST4 ST5 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  15. Subtree Rearrangements +1 ST6 ST2 ST1 ST3 ST4 ST5 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  16. Subtree Rearrangements +2 ST2 ST1 ST3 ST4 ST5 ST6 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  17. Subtree Rearrangements +2 ST2 ST1 ST3 ST4 ST5 ST6 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  18. Subtree Rearrangements ST2 ST1 Optimize all branches ST3 ST4 ST5 ST6 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  19. Subtree Rearrangements ST2 ST1 Need to optimize all branches ? ST3 ST4 ST5 ST6 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  20. Idea 1: Local Optimization of Branch Length ST2 ST1 ST3 ST6 ST4 ST5 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  21. Idea 1: Local Optimization of Branch Length ST2 ST1 ST3 ST6 ST4 ST5 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  22. Why is Idea 1 useful? • Local optimization of branch lengths: • Update less likelihood vectors -> significantly faster • Allows higher rearrangement settings -> better trees • Likelihood depends strongly on topology • Fast exploration of large number of topologies • Straight-forward parallelization • Store best 20 trees from each rearrangement step • Branch length optimization of best 20 trees only • Experimental results justify this mechanism Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  23. ST2 ST1 ST3 ST6 ST4 ST5 Idea 2:Subsequent Application of Topological Changes Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  24. ST2 ST1 ST2 ST1 ST6 ST4 ST5 ST3 ST6 ST4 ST5 Idea 2:Subsequent Application of Topological Changes ST3 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  25. ST2 ST1 ST2 ST1 ST6 ST4 ST5 ST3 ST2 ST1 ST6 ST4 ST5 ST6 ST4 ST5 Idea 2:Subsequent Application of Topological Changes ST3 ST3 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  26. ST2 ST1 ST6 ST4 ST5 ST2 ST1 ST6 ST4 ST5 Idea 2:Subsequent Application of Topological Changes ST2 ST1 ST3 ST3 ST6 ST4 ST5 ST1 ST2 ST3 ST3 ST6 ST4 ST5 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  27. Why is Idea 2 useful? • During inital 5-10 rearrengement steps many improved topologies are encountered • Acceleration of likelihood improvment in initial optimization phase • Enables fast optimization of random starting trees Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  28. Remainder of this Talk • Motivation • Introduction to phylogenetic tree inference • Statistical inference methods • Maximum Likelihood & associated problems • Solutions: • 2 simple heuristics • parallel & distributed implementation • Results • Conclusion • Availability & Future Work Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  29. Basic Parallel & Distributed Algorithm • Basic idea: Distribute work by subtrees instead of topologies (e.g. parallel fastDNAml) • Simple Master-Worker architecture • Subsequent application of topological changes introduces non-determinism ST2 ST1 ST3 ST6 ST4 ST5 Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  30. Basic Parallel & Distributed Algorithm • Basic idea: Distribute work by subtrees instead of topologies (e.g. parallel fastDNAml) • Simple Master-Worker architecture • Subsequent application of topological changes introduces non-determinism ST2 ST1 ST3 ST6 ST4 ST5 MPI_Send(ST3_ID, tree) Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  31. Basic Parallel & Distributed Algorithm • Basic idea: Distribute work by subtrees instead of topologies (e.g. parallel fastDNAml) • Simple Master-Worker architecture • Subsequent application of topological changes introduces non-determinism ST2 ST1 MPI_Send(ST2_ID, tree) ST3 ST6 ST4 ST5 MPI_Send(ST3_ID, tree) Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  32. Differences between Parallel & Distributed Algorithm • Parallel: best tree list of max(20, #workers) maintained and merged at the master • Parallel: Master distributes max(20, #workers) as toplogy-strings to workers for branch length optimization • Distributed: Each worker maintains local best list of 20 trees • Distributed: Worker performs fast branch length optimizations locally on all 20 trees -> returns only best topology to the master Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  33. Sequential Results • 50 distinct simulated 100-taxon alignments • Measured average execution times & topological distance (RF-rate) from „true“ tree • PHYML: 35.21 seconds, RF-rate: 0.0796 • MrBayes: 945.32 seconds, RF-rate: 0.0741 • RAxML: 29.27 seconds, RF-rate: 0.0818 • 9 distinct real alignments containing 101-1000 taxa • Measured execution times & final likelihood values • RAxML yields best-known likelihood for all data sets • RAxML faster than PHYML & MrBayes Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  34. Sequential Results: Real Data Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  35. Sequential Results: Real Data Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  36. Sequential Results: Real Data Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  37. Sequential Results: Real Data Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  38. Sequential Results: Real Data Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  39. Parallel Results: Speedup 1000_ARB Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  40. Distributed Results: First Tests • Platforms: • Infiniband-Cluster: 10 Intel Xeon 2.4 GHz • Sunhalle: 50 Sun-Workstations for CS students • Alignments: • 1000_ARB • 2025_ARB • Larger trees to come .......... • Results: • Program executed correctly & terminated • RAxML@home yielded best-known tree for 2025_ARB Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  41. Biological Results: 1st ML 10.000-taxon tree • Calculated 5 parsimony starting trees + 3-4 initial rearrangement steps sequentially on Xeon 2.4GHz • Further rearrangements of those 5 trees in parallel on 32 or 64 Xeon 2.66GHz at RRZE • Accumulated CPU hours/tree ~ 3200hours • Best ln likelihood: -949539 worst: -950026 • Problems: • Quality assessment? bootstrap not feasible • Consense crashes for > 5 trees • MrBayes/PHYML crash on 32-bit/4GB • MrBayes crashed on Itanium • Visualization? Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  42. Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  43. Conclusion • RAxML not able to handle protein data • RAxML not able to perform model parameter optimization • BUT: • RAxML easy to parallelize/distribute • Accurate & fast for large trees • Significantly lower memory requirements than MrBayes/PHYML • Conclusion: Imlement model parameter optimization & protein data in RAxML Alexandros Stamatakis: Phylogenetic Inference with RAxML2

  44. Availability & Future Work • Further development & distribution of RAxML@home • Big production runs with RAxML@home • Survey: ML supertrees vs. integral trees • Alignment split-up methods for ML supertrees • RAxML implementation on GPUs • RAxML2 download, benchmark, code: wwwbode.in.tum.de/~stamatak • RAxML@home development: www.sourceforge.com/projects/axml Alexandros Stamatakis: Phylogenetic Inference with RAxML2

More Related