1 / 13

Ab Initio Profile HMM Generation

Ab Initio Profile HMM Generation. Sam Gross. STOLEN FROM BATZOGLOU LECTURE. D m-1. D m. D 1. D 2. BEGIN. END. I 0. I 1. I m-1. I m. M 1. M 2. M m. Profile HMMs. Each M state has a position-specific pre-computed substitution table

Download Presentation

Ab Initio Profile HMM Generation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ab Initio Profile HMM Generation Sam Gross

  2. STOLEN FROM BATZOGLOU LECTURE Dm-1 Dm D1 D2 BEGIN END I0 I1 Im-1 Im M1 M2 Mm Profile HMMs • Each M state has a position-specific pre-computed substitution table • Each I and D state has position-specific gap penalties • Profile is a generative model: • The sequence X that is aligned to H, is thought of as “generated by” H • Therefore, H parametrizes a conditional distribution P(X | H) Protein profile H

  3. Õ P ( x | H ) i x i Ab Initio Profile Generation • Given N related protein sequences x1…xN • Construct a profile HMM H such that is maximized

  4. Easier Said Than Done • Profile HMM length is unknown • Use average sequence length • Alignment is unknown • HMM parameters are unknown

  5. Not A New Problem • Instance of the general problem of HMM parameter estimation using unlabelled outputs • Instance of the even more general problem of MLE with partially missing data • We want • We know q arg max P ( D | ) obs q q P ( D , D | ) obs hid

  6. The Expectation Maximization (EM) Algorithm • Start with initial guess for parameters • Iterate until convergence: • E-step: Calculate expectations for missing data • M-step: Treating expectations as observations, calculate MLE for parameters

  7. Baum-Welsh: EM For HMMs • Start with initial guess of HMM parameters • Iterate until convergence: • Forward-backward algorithm • MLE using forward-backward posterior probabilities

  8. Incorporating Prior Knowledge • We know in advance certain types of residues tend to align together • Use a Dirichlet mixture prior over outputs for match states • Each distribution in the mixture corresponds to a different “alignment environment”

  9. Coin Flips Example • Two trick coins used to generated a sequence of heads and tails • You see only the sequence, and must determine the probability of heads for each coin Coin A Coin B

  10. 10,000 Coin Flips • Real coins • PA(heads) = 0.4 • PB(heads) = 0.8 • Initial guess • PA(heads) = 0.51 • PB(heads) = 0.49 • Learned model • PA(heads) = 0.801 • PB(heads) = 0.413

  11. Toy Profile Example • Create a profile for the following sequences: • ADACGIH • ADAGIH • ADACGH • AACQH • ADAYGIH • Use the profile to align the sequences

  12. Results ADACGIH ADA-GIH ADACG-H A-ACQ-H ADAYGIH Match1 A 100% Match2 D 100% Match3 A 100% Match4 C 75%, Y 25% Match5 G 80%, Q 20% Match6 I 62%, H 38% Match7 H 100%

  13. Õ P ( x | F ( x )) i i x i Clustering With A Mixture Of Profiles • Given N protein sequences x1…xN • Construct M profile HMMs H1…HM and a mapping F: xH such that is maximized • F is a natural clustering of the protein sequences into M groups

More Related