130 likes | 224 Views
Ab Initio Profile HMM Generation. Sam Gross. STOLEN FROM BATZOGLOU LECTURE. D m-1. D m. D 1. D 2. BEGIN. END. I 0. I 1. I m-1. I m. M 1. M 2. M m. Profile HMMs. Each M state has a position-specific pre-computed substitution table
E N D
Ab Initio Profile HMM Generation Sam Gross
STOLEN FROM BATZOGLOU LECTURE Dm-1 Dm D1 D2 BEGIN END I0 I1 Im-1 Im M1 M2 Mm Profile HMMs • Each M state has a position-specific pre-computed substitution table • Each I and D state has position-specific gap penalties • Profile is a generative model: • The sequence X that is aligned to H, is thought of as “generated by” H • Therefore, H parametrizes a conditional distribution P(X | H) Protein profile H
Õ P ( x | H ) i x i Ab Initio Profile Generation • Given N related protein sequences x1…xN • Construct a profile HMM H such that is maximized
Easier Said Than Done • Profile HMM length is unknown • Use average sequence length • Alignment is unknown • HMM parameters are unknown
Not A New Problem • Instance of the general problem of HMM parameter estimation using unlabelled outputs • Instance of the even more general problem of MLE with partially missing data • We want • We know q arg max P ( D | ) obs q q P ( D , D | ) obs hid
The Expectation Maximization (EM) Algorithm • Start with initial guess for parameters • Iterate until convergence: • E-step: Calculate expectations for missing data • M-step: Treating expectations as observations, calculate MLE for parameters
Baum-Welsh: EM For HMMs • Start with initial guess of HMM parameters • Iterate until convergence: • Forward-backward algorithm • MLE using forward-backward posterior probabilities
Incorporating Prior Knowledge • We know in advance certain types of residues tend to align together • Use a Dirichlet mixture prior over outputs for match states • Each distribution in the mixture corresponds to a different “alignment environment”
Coin Flips Example • Two trick coins used to generated a sequence of heads and tails • You see only the sequence, and must determine the probability of heads for each coin Coin A Coin B
10,000 Coin Flips • Real coins • PA(heads) = 0.4 • PB(heads) = 0.8 • Initial guess • PA(heads) = 0.51 • PB(heads) = 0.49 • Learned model • PA(heads) = 0.801 • PB(heads) = 0.413
Toy Profile Example • Create a profile for the following sequences: • ADACGIH • ADAGIH • ADACGH • AACQH • ADAYGIH • Use the profile to align the sequences
Results ADACGIH ADA-GIH ADACG-H A-ACQ-H ADAYGIH Match1 A 100% Match2 D 100% Match3 A 100% Match4 C 75%, Y 25% Match5 G 80%, Q 20% Match6 I 62%, H 38% Match7 H 100%
Õ P ( x | F ( x )) i i x i Clustering With A Mixture Of Profiles • Given N protein sequences x1…xN • Construct M profile HMMs H1…HM and a mapping F: xH such that is maximized • F is a natural clustering of the protein sequences into M groups