590 likes | 769 Views
Parallel Bayesian Phylogenetic Inference. Xizhou Feng Directed by Dr. Duncan Buell Department of Computer Science and Engineering University of South Carolina, Columbia 2002-10-11. Topics. Background Bayesian phylogenetic inference and MCMC Serial implementation Parallel implementation
E N D
Parallel Bayesian Phylogenetic Inference Xizhou Feng Directed by Dr. Duncan Buell Department of Computer Science and Engineering University of South Carolina, Columbia 2002-10-11
Topics • Background • Bayesian phylogenetic inference and MCMC • Serial implementation • Parallel implementation • Numerical result • Future research
Darwin: Species are related through a history of common descent, and The history can be organized as a tree structure (phylogeny). Modern species are put on the leaf nodes Ancient species are put on the internal nodes The time of the divergence is described by the length of the branches. A clade is a group of organisms whose members share homologous features derived from a common ancestor. Background
Phylogenetic tree Internal node Ancestral species Clade Leaf node Current species Branch length Branch
Applications • Phylogenetic tree is • fundamental to understand evolution and diversity • Principle to organize biological data • Central to organism comparison • Practical examples • Resolve quarrel over bacteria-to-human gene transfers (Nature 2001) • Tracing route of infectious disease transmission • Identify new pathogens • Phylogenetic distribution of biochemical pathways
Objectives of phylogenetic inference output input Major objectives include: • Estimate the tree topology • Estimate the branch Length, and • Describe the credibility of the result
Phylogenetic inference methods • Algorithmic methods • Defining a sequence of specific steps that lead to the determination of a tree • e.g. UPGMA (unweighted pair group method using arithmetic average) and Neighbor-Joining • Optimality criterion-based methods • 1) Define a criterion; 2)Search the tree with best values • Maximum parsimony (minimize the total tree length) • Maximum likelihood (maximize the likelihood) • Maximum posterior probability (the tree with the highest probability to be the true tree)
Common Used phylogeny methods Algorithm Statistical Supported Search Strategy Bayesian Methods Stochastic search Maximum Likelihood Optimization method Divide & Conquer Maximum Parsimony Greedy search GA, SA MCMC Exact search Fitch-Margolish DCM, HGT, Quartet Algorithmic method Stepwise addition Global arrangement Star decomposition Exhaustive Branch & Bound Neighbor-join UPGMA Distance matrix Data set Character data
Aspects of phylogenetic methods • Robustness • If the model or assumption or the data is not exact correct, how about the result? • Convergence rate • How long a sequence is needed to recover the true tree? • Statistical support • With what probability is the computed tree the true tree? • Accuracy • Is the constructed tree a true tree? • If not, the percentage of the wrong edges? • Complexity • Neighbor-Join O(n3) • Maximum Parsimony (provably NP hard) • Maximum Likelihood (conjectured NP hard) • Scalability • Good for small tree, how about large tree
The computational challenge >1.7 million known species Number of trees increase exponentially as new species was added The complex of evolution Data Collection & Computational system Compute the tree of life Source: http://www.npaci.edu/envision/v16.3/hillis.html
Topics • Background • Bayesian phylogenetic inference and MCMC • Serial implementation • Parallel implementation • Numerical result • Future research
Bayesian Inference-1 • Both observed data and parameters of models are random variables • Setting up the joint distribution Topology Branch length Parameter of models When data D is known, Bayes theory gives: likelihood Prior probability Posterior probability Unconditional Probability of data
Bayesian Inference-2 • P(T|D) can be interpreted as the probability of the tree is correct • We need to do at least two things: • Approximate the posterior probability distribution • Evaluate the integral for P(T|D) • These can be done via Markov Chain Monte Carlo Method Having the posterior probability distribution , we can compute the marginal probability of T as:
Markov chain Monte Carlo (MCMC) • The basic idea of MCMC is: • To construct a Markov chain such that: • Have the parameters as the state space, and • the stationary distribution is the posterior probability distribution of the parameters • Simulate the chain • Treat the realization as a sample from the posterior probability distribution • MCMC = sampling + continue search
Markov chain • A Markov chain is a sequence of random variables {X0, X1, X2, …} whose transition kernel T(Xt, Xt+1) is determined only by the value of Xt (t>0). • Stationary distribution: (x)=Σx’((x’)T(x,x’)) is invariant • Ergodic property: pn(x) converges to (x) as n • A homogeneous Markov chain is ergodic if min(T(x,x’)/ (x’)>0
Metropolis-Hasting algorithm-1 • Cornerstone of all MCMC methods, Metropolis(1953) • Hasting proposed a generalized version in (1970) • The key point is to how to define the accepted probability: • Metropolis: • Hasting: Proposal probability Can be any form Such that
Metropolis-Hasting algorithm-2 • Initialize x0, set t=0 • Repeat : • Sample x’ from T(xt,x’) • Draw U~uniform[0,1] • Update
Problems of MH Algorithm & Improvement • Problems: • Mixing rate is slow when: • Small step->low movement • Larger step->low acceptance • Stopped at local optima • Dimension of state space may vary • Improvement: • Metropolis-coupled MCMC • Multipoint MCMC • Population-based MCMC • Time-reversible jump MCMC
Metropolis-coupled MCMC (Geyer 1991) • Run several MCMC chains with different distribution i(x) (i=1..m) in parallel • 1(x) is used to sampling • i(x) (i=2..m) are used to improve mixing • For example: i(x) = (x)1/(1+(I-1)) • After each iteration, attempt to swap the states of two chains using a Metropolis-Hasting step with acceptance probability of
Illustration of Metropolis-coupled MCMC 1(x)T1=0 2(x) T2=2 3(x) T3=4 4(x) T4=8 Metropolis-coupled MCMC is also called Parallel tempering
Sample from T(y, .) y1 x1* y2 x2* xt y xt+1 y3 x3* y4 x4* Choose y= yi Accept y or keep xt using a M-H step Sample from T(xt, .) Multiple Try Metropolis (Liu et al 2000)
Population-based MCMC • Metropolis-coupled MCMC uses a minimal interaction between multiple chains, why not more active interaction • Evolutionary Monte Carlo (Liang et al 2000) • Combine Genetic Algorithm with MCMC • Used to Simulate protein folding • Conjugate Gradient Monte Carlo (Liu et al 2000) • Use local optimization for adaptation • An improvement of ADS (Adaptive Direction Sampling)
Topics • Background • Bayesian phylogenetic inference and MCMC • Serial implementation • Choose the evolutionary model • Compute the likelihood • Design proposal mechanisms • Parallel implementation • Numerical result • Future research
A G C T DNA substitution rate matrix Purine transition transversion Pyrimidine • Consider inference of un-rooted tree and the computational complications, some simplified models are used (see next slide)
GTR-family of substitution models GTR • GTR: general time-reversible model, corresponding to a symmetric rate matrix. Three substitution type (1 transversion, 2 transition) Equal base frequencies TN93 SYM Two substitution types (transition v.s. tranversion) Three substitution type (1 transversion, 2 transition) HKY85 F84 K3ST Equal base frequencies Two substitution types (transition v.s. tranversion) Single substitution type K2P F81 Equal base frequencies Single substitution type JC69
More complex models Substitute rates vary across sites Invariable sites models + gamma distribution Correlation in the rates at adjacent sites Codon models 61X61 instantaneous rate matrix Secondary structure models
a t b Compute conditional probability of branch • Given substitution rate matrix, how to compute p(b|a,t)-the probability of a is substituted by b after time t Eigenvalue of Q
x5 t4 x4 t3 t2 t1 x2 x1 x3 Likelihood of a phylogeny tree for one site When x4 x5 are known, When x4 x5 are unknown,
Likelihood calculation (Felsenstein 1981) Given a rooted tree with n leaf nodes (species) , and each leaf node is represented by a sequence xi with length N, the likelihood of a rooted tree is represented as:
Likelihood calculation-2 • Felsenstein’s algorithms for likelihood calculation(1981) • Initiation: • Set k=2n-1 • Recursion: Compute for all a as follows • If k is a leaf node: • Set if ; • Set if . • If k is not a leaf node: • compute for all a its children nodes i, j. • And set • Termination: Likelihood at site u is Note: algorithm modified from Durbin et al (1998)
rate 1 rate 2 rate r A 1.0 1.0 1.0 C 0.0 0.0 0.0 G 0.0 0.0 0.0 T 0.0 0.0 0.0 Likelihood calculation-3 Site 1 Site 2 Site m-1 Site m-2 Taxa-1 Taxa-2 Taxa-3 Taxa-n • The likelihood calculation requires filling an N X M X S X R table • N: number of sequences M: number of sites • S: number of state of characters R: number of rate categories
Local update Likelihood • If we just change the topology and branch length of tree locally, we only need refresh the table at those affected nodes. • In the following example, only the nodes with red color need to change their conditional likelihood. Original tree Proposed tree
(2) Change m and y randomly (1)Choose a backbone c c v d m* m y u c u c x x* v d y* a a Proposal mechanism for trees • Stochastic Nearest-neighbor interchange (NNI) • Larget et al (1999) Huelsenbeck (2000)
k+ k- Min k k* Max Proposal mechanisms for parameters Larget et al. (1999) • Independent parameter • e.g. transition/tranversion ratio k • A set of parameters constrained to sum to a constant • e.g. base frequency distribution • Draw a sample from the Dirichlet distribution
Bayesian phylogenetic inference Prior probability DNA Data Likelihood Evolutionary model Phylogenetic tree Posterior prob. Proposal Starting tree inference A sequence of Samples MCMC Approximate the distribution
Topics • Background • Bayesian phylogenetic inference and MCMC • Serial implementation • Parallel implementation • Challenges of serial computation • Difficulty: • MCMC is a serial algorithm • Multiple chains need to be synchronized • Choose appropriate grid topology • Synchronize using random number • Numerical result • Future research
Computational challenge • Computing global likelihood needs O(NMRS2) multiplications • Local updating topology & branch length needs O(MRS2log(N)) • Updating model parameter needs O(NMRS2) • local update needs all required data in memory • Given N=1000 species, each sequence has M=5000 sites, rate category R=5, and DNA nucleotide model S=4 • Run 5 chains each with length of 100 million generations • Needs ~400 days (assume 1% global updates, 99% local update) • And O(NMRSLX2X2X8)~32Gigabyte memory • =>So until more advanced algorithms are developed, parallel computation is the direct solution. • Use 32 processor with 1 gigabyte memory, we can compute the problem in ~2 weeks
Characteristic of good parallel algorithms • Balancing workload • Concurrency identify, manage, and granularity • Reducing communication • Communication-to-computation ratio • Frequency, volume, balance • Reducing extra work • Computing assignment • Redundant work
Generate initial state S0, S(t)= S(0)=, t=0 Propose new state S’ Evaluate S’ Compute R and U No U <R? Yes S(t+1)=S’ S(t+1)=S(t) t=t+1 No Yes t>max generation End Single-chain MCMC algorithm
Multiple-chain MCMC algorithm Chain #1 Chain #2 Chain #3 Chain #4 Generate S1(0) t=0 Generate S2(0) t=0 Generate S3(0) t=0 Generate S4(0) t=0 Propose & Update S1(t) Propose & Update S2(t) Propose & Update S3(t) Propose & Update S4(t) choose two chains to swap Compute R and U U<R Swap the two selected chains t=t+1 t=t+1 t=t+1 t=t+1
chain rate site Task Concurrency • In one likelihood evaluation, the task can be divided into n_chain x n_site x n_rate concurrent tasks
y z x Task decomposition and assignment Concurrent tasks Processors virtual topology We organize the processor pool into an nx X ny X nz 3D grid Each grid runs n_chain/ny chains, each chain has n_site/nx sites, and each site has n_rate/nz category rate. (my current implementation dosen’t consider rate dimension, so the topology is a 2D grid). Choose nx, ny, and nz such that each processor has equal load
Parallel configuration Modes • Chain-level parallel configuration (Mode I) • Each chain runs on one row • Number of chains = number of rows • Site-level parallel configuration (Mode II) • All chains run on the same row • Different subsequences run on different columns • Number of subsequences = number of columns • Hybrid parallel configuration (Mode III) • Combine the first two configurations • Several chains may run on the same row and also • Divide the whole into segments
Orchestration • The data can be distributed at chain level and sub-sequence level, spatial locality is always preserved. At the same time, each process only needs 1/(nxXny) original memory • Each step, each processor needs 1 collective communication along its row and at most 2 (or 1) direct communication if it is one of the selected chain for swapping. • Processors at the same row can easily be synchronized because they use the same random number.
Communication between different rows • Processors at different rows need some trick to synchronize without communication • The trick is to use two random number generators • the first one is used to generate the proposal, processors at different row use different seeds for random number generator. • The second one is used to synchronize processors in different rows, using the same seed. So processors know what’s going on in the whole grid. • To reduce the size of the message, the selected processors swap message in two steps: • Swap current likelihood and temperature • Compute the acceptance ratio r and draw a random number U, if U<r, then swap the tree and parameters else continue to next step.
Symmetric parallel algorithm-1 • Build grid topology • Initialize: • Set seed for random number Rand_1(for proposal) and Rand_2(for synchronize) • For each chain • t = 0 • Building random tree T0 • Choose initial parameter θ0 • Set initial state as x0 • While (t<max_generation) do 3.1 In-chain update • For each chain • Choose proposal type using Rand_2 • Generate the proposal x’ • Evaluate log likelihood locally • Collect the sum of the local log likelihood • For each chain • Compute r using hasting transition kernel • Draw a sample u from uniform distribution using Rand_1 • If u<r set x=x’ (continue)
Symmetric parallel algorithm-2 • 3.2 inter-chain swap • Draw two chain indexes I and J from Rand_2 • If I==J go to 3.3 • Map I, J to the processor coordinate row_I and row_J • If row_I == row_J • Compute acceptance ratio r’ • Draw u’ from uniform distribution using Rand_2 • If u’<r’ Swap state x and the temperature of chain I and chain J • Go to 3.3 • If row_I != row_J • Swap log likelihood and temperature of chain I and chain J between processors • on chain row_I and row_J • Compute acceptance r’ • Draw u’ from from uniform distribution using Rand_2 • If u’<r’ Swap state x of chain I and chain J between between processors • on chain row_I and row_J • 3.3 t=t+1 • 3.4 if t%sample_cycle==0 output state x
The ith chain only needs to re-evaluate one node The jth chain needs to re-evaluate two nodes Assume two chains have the same tree at time t Load Balancing • For simplicity, from now we just consider 2D grid (x-y). • Along x-axis, the processors are synchronized since they evaluate the same topology, same branch length, and same local update step • But along y-axis, different chains will propose different topologies, so according to local likelihood evaluation, imbalance will occur. • In the parameter update step, we need global likelihood evaluation, balancing is not a problem.
Row group 1 mediator Row group 2 Row group 3 Asymmetric parallel algorithm • The symmetric algorithm assumes each processor has the same performance and deals with the imbalance between different rows by waiting and synchronization. • An improvement is to use an asymmetric algorithm. The basic idea is: • Introduce a processor as the mediator • Each row send its state to the mediator after it finished in-chain update • The mediator keeps the current states of different rows, and decides which two chains should swap • The selected chains get new states from the mediator • Asymmetric algorithm can solve the unbalance problem, but assumes that different chains don’t need to synchronize.