280 likes | 479 Views
Models of Evolution. Majid Kazemian. Introduction. Probabilistic Model of Indels Model of an arbitrary distribution of indel lengths (TKF Model) MCALIGN We have seen above models in the course Models of Nucleotide Substitution Jukes Cantor model Kimura model. Phylogeny Tree .
E N D
Models of Evolution Majid Kazemian
Introduction • Probabilistic Model of Indels • Model of an arbitrary distribution of indel lengths (TKF Model) • MCALIGN • We have seen above models in the course • Models of Nucleotide Substitution • Jukes Cantor model • Kimura model
Phylogeny Tree • Given a set of sequences x0=(x1,x2,…,xn) the goal is to infer Phylogeny tree • Suppose that • n= # of species • T=Topology of the tree • t0 is the edges’ length in the tree • We want to compute pr(x0|T,t0)
t5 x5 t4 x4 t3 t2 t1 x2 x3 x1 A simple example • Suppose we have the following phylogeny tree then: • So to calculate pr(x0|T,t0) we need Pr( x | y, t), the probability that y evolves to x in time t
Substitution • Assume that • Indels do not occur • Each position of sequence evolves independently • Then • Where pr(xj | yj, t) is the probability of a change from “yj” to “xj” in time t Ancestor : y1y2…yL Descendant : x1x2…xL
aj xj yj t1 t2 time t1+t2 t0 t1 The assumption of the model • Multiplicity requirement • S(t1)S(t2)=S(t1+t2) • This requirement will hold if the transition probabilities be stationary and Markovian • Intuitively means that the probability of going from yj to xj just depends on (t2+t1) – t1
Jukes Cantor Model (cont.) • In small amount of time ε probability of substitution is linear to time. This means that we can not go from Ai to Aj and go back to Ai. • S(ε)≈ I + Rε
Jukes Cantor Model (cont.) • Is S(t) similar to S(ε)?
Jukes Cantor Model (cont.) • We know that S(t) has the following form (why ?)
More advanced models • The J-C model made highly “symmetric” assumptions, in its formulation of the rate matrix R • In reality, for example, “transitions” are more common than “transversions” • What are these? Purine = A or G. Pyrimidine = C or T. Transition is substitution in the same category; transversion is substitution across categories • Purines are similarly sized, and pyrimidines are similarly sized. More likely to be replaced by similar sized nucl. • The “Kimura” model captures this transition/transversion bias
Kimura Model • The rate matrix R is given by:
Kimura Model (cont.) • We know that S(t) should look like this (why ?)
Kimura Model (cont.) • Again by solving differential equations (like what we did for JC model) we have
Even More advanced models (cont.) • Get to greater levels of realism • Kimura model still has a uniform stationary distribution, which is not true of real data • One extension: purine to pyrimidine subst. prob. is different from pyrimidine to purine subst. prob. • This leads to a non-uniform stationary probability • The “HKY” model captures this bias
t2 t1 x2 x1 Inferring Phylogeny for two sequences • Let’s back to the original problem, we wanted to compute pr(x0|T,t0) • In the case of two sequences without gap we have Probability of root
A simple example • Suppose that • x1=C C G G C C G C G C G • x2=C G G G C C G G C C G
A simple example (cont.) • Assume JC model • Our goal is to find the tree topology, t1 and t2
A simple example (cont.) • Suppose that n1 is the number of CC and GG pairs and n2 is #CG + #GC pairs • So • If α is known then we can find t1+t2 by simple Maximum Likelihood • α is estimated based on two close species that we assume t1+t2=1
Parent of node i All possible internal node assignments Inferring Phylogeny for n sequences • How to infer topology and t0 for n sequences • How to compute this probability efficiently?
Dynamic Programming • The recursion: probability of all leaves below node k given that residue at k is α • How to estimate (T,t0)? ML estimation? α b c
How to infer topology? • The naïve way is to enumerate all topologies and solve ML estimation for a topology with numerical approaches (like Newtonian method) • This is not good if we have many species • The idea of inferring topology is utilizing a sampling technique
Metropolis Sampling • We have • We must propose rejection and acceptance mechanism to go
Proposal distribution • Accept with following probability
Two comments • We made an independence assumption for column of genome, some region are evolving faster and some slower • We assumed that there is no gap • We need to consider gap (e.g pair HMM)
Reference • Probabilistic Models of Proteins and Nucleic Acids ( by Richard Durbin , Sean R. Eddy , Anders Krogh , Graeme Mitchison) • 8.1 - 8.2 - 8.3 - 8.4 - 8.5