90 likes | 161 Views
Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit. J. Hein, C. Wiuf, B. Knudsen, M.B. Moller and G. Wibling. Main Objective of the paper.
E N D
Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit J. Hein, C. Wiuf, B. Knudsen, M.B. Moller and G. Wibling
Main Objective of the paper • To show how to accelerate the statistical alignment algorithms several orders of magnitude using the model of insertion and deletions by Thorne, Kishino, and Felsenstein in 1991 (TKF91 model). • To propose a new homology test based on the model. • To describe a goodness-of-fit test that allows testing the proposed insertion-deletion process inherent to the model.
Why isn’t statistical alignment popular? • Computationally VERY SLOW • Authors of the paper accelerated the statistical alignment algorithms several orders of magnitude compared with the TKF91 algorithm. • Lack of user-friendly software? • Usually written in Fortran or C, or the compiled program only works in UNIX environment, but most biologists don’t know much about it. • Authors of the paper have provided a web interface to the program
parsimony and similarity alignments • Parsimony strategy: minimizing the distance • For example: • Similarity strategy: maximizing the similarity score • For example: BLAST
TKF91 model of substitutions • continuous time Markov model on the state space of nucleotides or amino acids • Rate matrix Q is specified • Describes the intensity of different substitution events over an infinitesimal time period. • Probability that i has changed to j after time t is • The process is assumed to be time reversible:
TKF91 model of the indel process • Can be view as a Markov model with all sequences as possible states • indel part of the model • links connecting the letters of the sequences • each has a mortal link on the right • left end has an immortal link • For example: A G G • If the type of the nucleotide is ignored, can be represented as
TKF91 model • mortal link can give birth to a new mortal link or die out • immortal link can also give birth but would not die • Therefore, the rates can be written as: A G G I0 S1 I1/D1 S2 I2/D2 S3 I3/D3 where I is the birth rate D is the death rate, D>I S is the substitution rate
TKF91 model To calculate the probability of a particular alignment: s(1): A T - s(2): C T G P(s(1), s(2), alignment) = (p1’’)(AP1 PAC)(T P2 PTT G)
Calculating the probability of two sequences • Without conditioning on the alignment, it is necessary to sum over all alignments weighted with their probabilities according to the TKF91 process. • Confine likelihood calculations to a band close to the similarity based alignment allows an efficient numerical optimization algorithm for finding the maximum likelihood estimate • The recursions originally presented by Thorne, Kishino and Felsenstein can be simplified.