Overview of Hidden Markov Models (HMMs) and profiles

Overview of Hidden Markov Models (HMMs) and profiles

From this lecture: • Profiles • Basics of Hidden Markov models • Estimating HMM parameters • Sequence weighting • Using HMMs for alignment and homolog detection • Subfamily HMMs

Eddy papers in Nature Biotechnology http://selab.janelia.org/publications Recommended reading

UCSC tutorial on HMMs (by Rachel Karchin) http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html (useful, but not required)

HMMs are a kind of profile http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Sample profile Gribskov et al, PNAS 1987 Gribskov et al, PNAS 1987

HMM for 5’ splice site (5’SS) recognition Eddy, Nature Biotechnology 2004 Assumptions (encoded in model): • Exons (E) have a uniform base composition • Introns (I) are A/T rich • 5’SS is almost always G

HMM for splice site recognition Eddy, Nature Biotechnology 2004

HMM parameter estimation using unaligned training sequences Delete/skip Insert Match • HMM parameter estimation: • Compute probabilities of data given model • Align sequences to HMM • Gather statistics of paths taken through HMM (Expectation step) • 2. Modify HMM parameters to Maximize Prob (data | model) (Maximization step) (Maximum Likelihood) • Iterate Steps 1-3 until parameters converge. >Seq1 MIVSP >Seq2 MVVSTGP >Seq3 MVVSSGP >Seq4 MVLSSPP >Seq5 MLSGPP training data

Hidden Markov Model (HMM) Delete/skip Insert Match END START M O R N I N G Originally used in speech recognition (Rabiner, 1986) • Proposed for DNA modeling (Churchill, 1989) • Applied to modeling proteins (Haussler et al, 1992) • Multiple sequence alignment • Identification of related family members (“homologs”)

Aligning sequences to an HMM to construct an MSA Note: how to read a UCSC a2m-formatted MSA (in-class) http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Generating a multiple alignment by aligning sequences to an HMM >Seq6 MIVSTSG >Seq7 MVVTTG >Seq8 SP >Seq9 PP Seq6 M I V S T S G Seq7 M V V - T T G Seq8 - - - - - S P Seq9 - - - - - P P

Estimating HMM parameters http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.htmlhttp://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Viterbi and Baum-Welch algorithms http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Simulated annealing and other methods for handling local optima http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Sequence weighting http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Henikoff weighting http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Henikoff weighting Weight of a character in the MSA = 1/m*k m = #unique amino acids seen k = # times a particular amino acid is seen Weight of a sequence is the average of the weights in all positions, normalized to sum to 1. http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Overfitting and regularization http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Using pseudocounts http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Dirichlet mixture densities http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Including prior information in profile or HMM construction The use of Dirichlet mixture densities

D S I F M K D S V F M K D T I W M K D T I W M K D T V W M K Profile or HMM parameter estimation using small training sets What other amino acids might be seen at this position among homologs? What are their probabilities? .

D S I F M K D S V F M K D T I W M K D T I W L K D T L W L R The context is critical when estimating amino acid distributions This position may be critical for function or structure, and may not allow substitutions .

Dirichlet Mixture Prior “Blocks9” Parameters estimated using Expectation Maximization (EM) algorithm. Training data: 86,000 columns from BLOCKS alignment database.

ˆ pi = the estimated probability of amino acid ‘i’ n = (n1,…,n20) = the count vector summarizing the observed amino acids at a position. αj = (αj,1 ,…, αj,20 ) = the parameters of component j of the Dirichlet mixture Θ. Combining Prior Knowledge with Observations using Dirichlet Mixture Densities Dirichlet Mixtures: A Method for Improved Detection of Weak but Significant Protein Sequence Homology. Sjolander, Karplus, Brown, Hughey, Krogh, Mian and Haussler. CABIOS (1996)

http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.htmlhttp://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Log-odds ratio http://www.cse.ucsc.edu/research/compbio/ismb99.handouts/KK185FP.html

Seq1 M V V S - - P Seq2 M V V S T G P Seq3 M V V S S G P Seq4 M V L S S P P Seq5 M - L S G P P HMM construction using an initial multiple sequence alignment Delete/skip Insert Match

In searching for family members, all features must be assumed to be equally informative.

Without knowing which features are more important, would we recognize this relative?

Gathering family members allows us to identify conserved attributes and create a profile Conserved: stripes, cat. Variable: coat color, size.

Profile generalization allows us to identify sometruly remote relatives

Conflict • For effective remote homolog detection, a profile or HMM needs information from divergent family members • Without this context, we cannot differentiate critical from variable positions • HMMs constructed with such data provide a coarse classification • But, the more variability we introduce in training data, the greater the potential noiseat some positions D S LF MK I D S IF MK V D T IW MK M D T IW MK L D T VW MK F D T FR DK I D T FR DK V

Divergence across the family; conservation within subfamilies Average BLOSUM62 Score Position

Subfamily HMM Construction

3.5.2.2 Dihydropyrimidinase 3.5.4.1 Cytosine deaminase 3.5.2.3 Dihydroorotase 3.5.1.5 Urease Subfamily Assessing classification accuracy 7TM GPCR ABC Transporter Amidohydrolase ATPase Family

Subfamily HMMs (SHMMs) Discovering and Modeling Functional Subtypes

How to build Subfamily HMMs (SHMMs) D S LF MK I D S IF MK V D T IW MK M D T IW MK L D T VW MK F D T FR KK I D T FR KK V Share statistics between subfamilies where there is evidence of a common distribution. 1 2 3 4 5 6 7 Keep statistics separate at positions where there is evidence of divergent structure. 3 4 5 1 2 6 7 Improved specificity, sensitivity, alignment accuracy

Step 1: Form Dirichlet Mixture Posterior At each position, for each subfamily, construct a Dirichlet mixture posterior, by combining the Dirichlet mixture prior with the amino acids aligned at that position by that subfamily. (Weighted) subfamily counts Mixture coefficient Component Parameters (Weighted) subfamily counts of amino acid i

Step 2: Calculate family contribution Other subfamilies contribute, proportional to the probability of the amino acids they aligned at that position, given the revised Dirichlet mixture density. D S LF MK I D S IF MK V D T IW MK M D T IW MK L D T VW MK F D T FR KK I D T FR KK V (Weighted) counts from subfamily s΄ (Formula for computing Prob (n | Θ ) are in Sjolander et al, 1996)

Step 3: Compute pseudocounts Add the family contribution to the observed (weighted) counts, to obtain the pseudocounts ti of amino acid i: (Weighted) subfamily counts for subfamily s family contribution

Step 4: Compute amino acid probabilities Normally, we compute amino acid probabilities by combining a Dirichlet mixture prior with observed counts as follows:

SHMM Remote Homolog Detection • 515 PFAM Full MSAs, each corresponding to a unique SCOP Fold. • Family HMMs constructed using UCSC SAM w0.5 software. • Subfamily HMMs constructed using BETE. • Each sequence in PDB90 assigned a family score and a subfamily score (best-of-SHMMs). • E-values computed by fitting these scores to an extreme value distribution Brown D, Krishnamurthy N, Dale J, Christopher W, and Sjölander K, "Subfamily HMMs in Functional Genomics", Proceedings of the Pacific Symposium on Biocomputing, 2005

Overview of Hidden Markov Models (HMMs) and profiles

Overview of Hidden Markov Models (HMMs) and profiles

Presentation Transcript

Hidden Markov Models

Hidden Markov Models in Bioinformatics

Hidden Markov Models HMMs Rosanne Vetro

Hidden Markov Models

Hidden Markov Models (HMMs)

Hidden Markov Models

Hidden Markov Models

Hidden Markov models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models (HMMs)

Hidden Markov Models (HMMs)

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models (HMMs)

Applications of Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models