E N D
Phylogenetic HMMs A. Siepel and D. Haussler. Computational identification of evolutionarily conserved exons. Proc. 8th Annual Int'l Conf. on Research in Computational Biology (RECOMB '04).A. Siepel and D. Haussler. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol. Evol. 21:468-488, 2004.A. Siepel and D. Haussler. Combining phylogenetic and hidden Markov models in biosequence analysis. Proc. 7th Annual Int'l Conf. on Research in Computational Biology (RECOMB '03) Sam Gross
What is a phylogenetic HMM? • Combination of HMM and probabilistic phylogenetic model of inheritance • Just an HMM with emission probabilities defined by state-specific phylogenetic models • Can be used for the same applications as regular HMM or phylogenetic model (e.g., alignment, determining evolutionary relationships, gene prediction)
Phylogenetic Models • Define the probability of a column of a multiple alignment 12345678 Human AGTCACCT Dog AGTTATTT Mouse CCTTATTT Rat CCTTATTT Pr (column 1) = Pr (H1=A, D1=A, M1= C, R1= C)
A fully general distribution over n sequences needs 4n parameters • We can reduce the number of parameters by introducing conditional independence assumptions about how the sequences evolved
Phylogenetic Tree Model A1 D A2 H A3 M R Conditional independence assumption: Probability distribution of a node depends only on the value of its parent Pr (A1, A2, A3, H, D, M, R) = Pr (A1) Pr (D|A1) Pr (A2|A1) Pr (H|A2) * Pr (A3|A2) Pr (M|A3) Pr (R|A3) Now we need only 4 + 16*(2n-2) parameters to specify the distribution
Continuous Markov Process Of Base Substitution • We can further reduce the number of parameters by assuming substitution along a branch (u, v) occurs via a continuous Markov process operating over a time βv. The rate matrix Q that defines this process is constant; only the βv’s vary between branches.
Substitution Rate Matrix Parameterizations • Unrestricted model • Time reversible model • HKY model
Q is not used directly; instead, we want P(t) = pi,j, a matrix defining the probability of substituting base j for base i along a branch of length t. Now we’re down to 2n-2 + 12 (unrestricted), 2n-2 + 6 (reversible), or 2n-2 + 1 (HKY) parameters
Algorithms For Phylogenetic Models • Inference: Given a model and its parameters, calculate the probability of any assignment or partial assignment • Parameter Estimation: Given a multiple alignment, estimate the parameters of the model that produced that alignment
Calculating The Probability Of An Assignment: Felsenstein’s Algorithm • Given an assignment Xi, uses dynamic programming to calculate probability in linear time: Let P(Lu|a) denote the probability of all the assigned bases descending from node u given that node u is assigned base a. Let v and w be the children of u, and tvand twbe the connecting branch lengths. With the summations over all bases consistent with Xi
Parameter Estimation • We want the maximum likelihood estimate of Q and β: • Nonlinear optimization problem with missing data—not so easy! • Best strategy so far seems to be using a combination of EM and quasi-Newton optimization
EM/Quasi-Newton Optimization Start with initial guess for Q, β. Iterate until convergence: E-step: Calculate expected number of substitutions along each branch using current estimates of Q and β M-step: Use quasi-Newton optimizer to maximize the probability of the expected data by varying Q and β Algorithm is guaranteed to converge to a local, not global, maximum For N=2, unrestricted Q, Siepel and Haussler report 6.6 days on a 2.4GHz Pentium IV. However, reversible models take closer to 1 day.
Generalizations • Incorporating Gaps In The Alignment: • Ignore columns with gaps • Treat gaps as missing data • Treat gaps as 5th character • Separate HMM states for each gap pattern
Incorporating Context: Simply redefine the alphabet to contain all (N+1)-mers and change the dimensions of Q, π as appropriate. Slight change to inference algorithm: Denominator can be calculated by running Felsenstein’s algorithm and treating any N-mer consistent with the first N-1 bases as consistent with the whole N-mer Time to calculate probability of a column is
Phylogenetic HMMs For Gene Finding • EXONIPHY uses a phylogenetic HMM to predict genes, with somewhat disappointing results • Phylogenetic model is very sophisticated, but HMM is not generalized and state model is basic • Opportunity for phylogenetic gHMM with good state model
Adding Phylogenetic Models To GENSCAN A1 H D A2 A2 H A3 A1 A3 M R D M R Pr (A1, A2, A3, H, D, M, R) = Pr (H) Pr (A2|H) Pr (A1|A2) * Pr (A3|A2) Pr(D|A1) (M|A3) Pr (R|A3) Theorem: If Q is reversible, the choice of root does not affect the probability distribution defined by the tree
This means we can use the target’s GENSCAN model for the root distribution, instead of having to estimate an ancestral distribution • Columns with gaps in target sequence are ignored for now
Handling Gaps • 6-character alphabet: A, C, G, T, gap, unaligned • Substitution rates for tuples containing gaps depend only on the gap pattern—e.g., A-G and A-T both have the pattern X-X • In most the general reversible model, number of parameters in Q goes from 288 to 393. Number of parameter in P goes from 4096 to 4321. • No need to create extra states
Other Considerations • Inference works the same, except recursion starts at target and internal nodes may now have either one or two children • Informant selection • May need CNC state to avoid false positives from large conservation scores • 7th symbol for multiple-of-three gap?