SuperFine , Enabling Large -Scale Phylogenetic Estimation

SuperFine, Enabling Large-Scale Phylogenetic Estimation Shel Swenson University of Southern California and Georgia Institute of Technology

Phylogeny(evolutionary tree) Orangutan Human Gorilla Chimpanzee “Nothing in Biology makes sense except in the light of evolution” – Dobhzhansky 1 3 2 (1-3) From the Tree of the Life Website,University of Arizona

Tree of Life, Importance to Biology Biomedical applications Mechanisms of evolution Tracking ancient migrations Protein structure and function Drug design We are here 1 2 3 1) Nature Reviews (Genetics) 2) Howard Hughes Medical Institute (BioInteractive) 3) 1000 Genomes Project

-3 million yrs AAGACTT -2 million yrs AAGGCCT TGGACTT AAGGCCT TGGACTT -1 million yrs AGGGCAT TAGCCCT AGCACTT AGGGCAT TAGCCCT AGCACTT TAGCCCA TAGACTT AGCACAA AGCGCTT today AGGGCAT DNA sequence evolution (idealized) AAGACTT AAGGCCT AAGGCCT TGGACTT TGGACTT AGGGCAT TAGCCCT AGCACTT AGGGCAT TAGCCCA TAGACTT AGCACAA AGCGCTT

Phylogeny Problem U V W X Y AGACTA TGGACA TGCGACT AGGTCA AGATTA X U Y V W U V W X Y

Two basic approaches for tree estimation on multi-gene datasets • Apply phylogeny estimation methods to concatenated (“combined”) sequence alignments for different genes • Compute trees on individual genes and apply a supertree method This Talk:SuperFine, boosts supertree methods, enabling faster, more accurate estimation for large scale problems

gene 1 gene 3 S1 TCTAATGGAA S1 S2 gene 2 GCTAAGGGAA TATTGATACA S3 S3 TCTAAGGGAA TCTTGATACC S4 S4 S4 TCTAACGGAA GGTAACCCTC TAGTGATGCA S7 S5 TCTAATGGAC GCTAAACCTC S7 TAGTGATGCA S8 S6 TATAACGGAA GGTGACCATC S8 CATTCATACC S7 GCTAAACCTC Using multiple genes

Concatenation gene 2 gene 3 gene 1 S1 TCTAATGGAA ? ? ? ? ? ? ? ? ? ? TATTGATACA S2 GCTAAGGGAA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S3 TCTAAGGGAA TCTTGATACC ? ? ? ? ? ? ? ? ? ? S4 TCTAACGGAA GGTAACCCTC TAGTGATGCA S5 GCTAAACCTC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S6 GGTGACCATC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S7 TCTAATGGAC GCTAAACCTC TAGTGATGCA S8 TATAACGGAA CATTCATACC ? ? ? ? ? ? ? ? ? ?

Two competing approaches Analyze separately . . . Supertree Method gene 1gene 2 . . . gene k Species . . . Concatenation

Missing data Large dataset sizes Incompatible data types (e.g., morphological features, biomolecular sequences, gene orders, even distances based upon biochemistry) Unavailable sequence data (only trees) Why use supertree methods?

MRP weighted MRP Min-Cut Modified Min-Cut Semi-strict Supertree MRF MRD QILI SDM Q-imputation PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more ... Matrix Representation with Parsimony (Most commonly used and among most accurate) Many Supertree Methods

Quantifying Error FN FN: false negative (missing edge) FP: false positive (incorrect edge) FP 50% error rate

FN rateMRP vs. Concatenation MRP Concatenation FN Rate (%) Scaffold Density (%) Concatenation is not always an option We need better supertree methods

FN RateSuperFine vs. MRP and Concatenation MRP SuperFine Concatenation FN Rate (%) Scaffold Density (%)

Running TimeSuperFine vs. MRP (Concatenation is much slower) MRP SuperFine Minutes MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%)

Idea behind SuperFine • Construct a supertree with low false positive rate • Reduce false negatives by resolving areas of uncertainty using a supertree methodQuartet Max Cut (Swenson et al., Systematic Biology, 2011)

c d c a e a d e b f b f T B(T) = {ab|cdef, abc|def,abcd|ef} T’ B(T’) = {ab|cdef, abc|def} Bipartitions and refinement Let B(T) denote the set of (non-trivial) bipartitions induced by the edges of T. TrefinesT’ (T’≤T) if B(T)  B(T’) Polytomy Refinement

Idea behind SuperFine • Construct a supertree with low FP using the Strict Consensus Merger (SCM) (Huson et al. 1999) • Reduce FN by resolving each polytomy using a supertreemethodQuartet Max Cut

e b a e c b a f g d f g b a c h d i j c h i j d Strict Consensus Merger (SCM) b e a f c d g a b c h d i j

e b a f g c h d i j Property of SCM: Bipartitions in SCM tree correspond to bipartitions in the source trees e b b e a a c f c d f g g d a b b a c h c i j d h d i j Swenson, Ph.D. Thesis, 2009

Performance of SCM • Low false positive (FP) rate (Estimated supertree has few false edges) • High false negative (FN) rate (Estimated supertree is missing many true edges) • Runs in polynomial time (in the number of source trees and total number of species)

Idea behind SuperFine • Construct a supertree with low FP using SCM • Refine the tree to reduce FN by resolving each polytomy using a supertree method (eg. MRP)Quartet Max Cut

Resolving a single polytomy, v • Step 1: Reduce each source tree to a tree on {1,2,...,d}, where d=degree(v) • Step 2: Apply MRP to the collection of reduced trees, to produce a tree t on leafset {1,2,...,d} • Step 3: Replace the star tree at v by tree t

b 1 e 1 a 1 e b f 6 a c 1 d 4 g 5 f g a b 1 1 c 1 2 3 h d h i j i j 4 5 6 c 1 a c e b h 2 d g f d 4 i j 3 3 Back to Our Example

b e a 1 e b f 6 a c d 4 g 5 f g a b c h d 1 i j c h 2 3 d 4 i j Where We Use the Property

b e a f c d g a b c h d i j Step 1: Reduce each source tree to a tree on the set {1,2,...,d} 1 6 4 5 1 2 4 3

1 4 6 5 1 4 2 3 Step 2: Apply MRP to the collection of reduced trees 5 MRP 1 4 MRP 2 3 6

b e c a e b a f g c h j d i j i i j a c e b Replace polytomy using tree from MRP g 5 d 4 1 2 3 h 6 f h d g f

FN RateSuperFine vs. MRP and Concatenation MRP SuperFine Concatenation FN Rate (%) Scaffold Density (%)

Running TimeSuperFine vs. MRP (Concatenation is much slower) MRP SuperFine Minutes MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%)

SuperFine: Boosting supertree methods • Superfine+MRP vs. MRP (Swenson et al. 2011) • SuperFinecombines the features of the SCM method (polynomial time, low false positive rates) with the lower false negative rate of MRP, to achieve greater accuracy in less time. • Speed-up results from the re-encoding of source trees as smaller trees. • SuperFine+QMC vs. QMC (quartet-based) • QMC (Snir 2008), polynomial time, but infeasible for 500+ taxa • SuperFine+QMC, runs where QMC cannot (Swenson et al. 2010) • SuperFine+MRL vs. MRL (likelihood) (Nguyen et al. 2012) • SuperFine+MRL, faster and more accurate, similar likelihood scores DACTAL (Nelesen, et al. 2012) Boosting concatenation methods; uses SuperFine in its divide-and-conquer strategy

SuperFine , Enabling Large -Scale Phylogenetic Estimation

SuperFine , Enabling Large -Scale Phylogenetic Estimation

Presentation Transcript

Large Scale Weather

Large Scale Structure

Large-Scale Phylogenetic Analysis

large scale Refactoring

Large-scale matching

LARGE SCALE

Large- scale Organisations

Cardinality Estimation for Large-scale RFID Systems

Phylogenetic Estimation using Maximum Likelihood

Introduction to Phylogenetic Estimation Algorithms

Enabling Large Scale Wireless Broadband: The Case for TAPs

LARGE SCALE ORGANISATIONS

Large scale

Large-Scale Systems

Nonparametric estimation of phylogenetic tree distributions

Large Scale Sharing

Large Scale Operations

Large Scale Applications

Introduction to Phylogenetic Estimation Algorithms

Enabling Reuse-Based Software Development of Large-Scale Systems

Place Lab: Enabling Large-Scale, Location-Enhanced Computing Ian Smith

Large Scale Drupal