Profile Hidden Markov Models

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and DrClaude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering The Pennsylvania State University

Outline • Introduction to HMMs • Profile HMMs • Available resources for Profile HMMs • Some online demonstrations

Introduction to HMMs • Hidden Markov Models – Formalism • statistical techniques for modeling patterns in data • First order Markov property - memorylessness • state generally a hidden entity which spawns symbols or features • the same symbol could be emitted by several states • HMM characterized by transition probabilities and emission distribution

Introduction to HMMs • Hidden Markov Models – Parameter Estimation • Parameters- transition probabilities and emission probabilities • iterative computational algorithms used • EM algorithm, Viterbi algorithm • algorithms based on dynamic programming to save computational cost • usually the iterations involve variants of the following two steps • estimate state sequence which maximizes likelihood under a parameter set • update parameter set based on the estimated state sequence • algorithms converge to local optima sometimes

Profile Hidden Markov Models • Stochastic methods to model multiple sequence alignments – proteins and dna sequences • Potential application domains: • protein families could be modeled as an HMM or a group of HMMs • constructing a profile HMM • new protein sequences could be aligned with stored models to detect remote homology • aligning a sequence with a stored profile HMM • align two or more protein family profile HMMs to detect homology • finding statistical similarities between two profile HMM models

Profile Hidden Markov Models • Constructing a profile HMM • A multiple sequence alignment assumed • each consensus column can exist in 3 states • match, insert and delete states • number of states depends upon length of the alignment

Profile Hidden Markov Models • A typical profile HMM architecture • squares represent match states • diamonds represent insert states • circles represent delete states • arrows represent transitions

Profile Hidden Markov Models • A typical profile HMM architecture • transition between match states - • transition from match state to insert state - • transition within insert state - • transition from match state to delete state - • transition within delete state - • emission of symbol at a state -

Profile Hidden Markov Models • Estimation of parameters • transition probabilities estimated as frequency of a transition in a given alignment • emission probabilities estimated as frequency of an emission in a given alignment • pseudo counts usually introduced to account for transititions / emissions which were not present in the alignment

Profile Hidden Markov Models • Estimation of parameters • with pseudo counts • Dirichlet prior distribution used to determine pseudo counts

Profile Hidden Markov Models • Scoring a sequence against a profile HMM • Viterbi algorithm used to find the best state path • Simulated annealing based methods also used • Maximization criteria – log likelihood or log odds • Log likelihood score generally depends on length of sequence and hence not preferred • If an alignment not given initially, the alignment could be learnt iteratively using Viterbi

Profile Hidden Markov Models • Comparing two profile HMMs • Profile-profile comparison tool based on information theory • based on Kullback-Leibler divergence criterion for comparing 2 statistical distributions • dynamic programming used to compare entire profiles • detect weak similarities between models

Available resources for Profile HMMs • HMMER and SAM one of the first available programs for profile HMMs • HMMER : S Eddy at Washington University • SAM : Sequence alignment and Modeling System R. Hughey at University of California, Santa Cruz • available free for research • SAM has online servers to perform sequence comparisons http://www.cse.ucsc.edu/research/compbio/sam.html

Available resources for Profile HMMs • InterPro consortium in Europe has many resources for protein data • Database of protein families and domains • Brings together several different databases under one umbrella • Pfam and Superfamily are profile HMM libraries associated with Interpro • Pfam based on HMMER search and Superfamily based on SAM search and modeling

Available resources for Profile HMMs • SAM’s iterative approach for building HMM • find a set of close homologs using BLASTP • learn the alignment and build model using close homologs • use BLASTP to get more remote homologs using the first set of sequences (relax the E value) • iteratively refine the HMM model • SAM uses Dirichlet priors as pseudo counts for parameters • Hand tuned seed alignments not required as the alignments are learnt by the algorithm – unlike HMMER

Available resources for Profile HMMs • SUPERFAMILY database incorporates: • library of profile HMMs representing all proteins of known structure • assignments to predicted proteins from all completely sequenced genomes • search and alignment services • models and domain assignments are freely available • Based on SCOP classification of protein domains • SAM HMM iterative procedure used for model building and sequence alignment

Available resources for Profile HMMs • In Superfamily: • Each SCOP superfamily is represented as an HMM model • Model built using SAM procedure based 4 variants • accurate structure based alignments • hand labeled alignments • autonomic alignments using ClustalW • sequence members used separately as seeds • Assignment of superfamilies • for a given sequence, every model is scored across the whole sequence using Viterbi scoring • model which scores highest has its superfamily assigned to the region

Online Demonstrations http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/temp/624288710157514.html

References • Durbin. R, Eddy. S, Krough. A, and Mitchenson. G, ``Biological Sequence Analysis’’, Cambridge University Press, 2002 • Baldi. P and Brunak. S, ``Bioinformatics, the Machine Learning Approach’’, the MIT Press, Cambridge, 1998 • Eddy. S, ``Profile Hidden Markov Models’’, Bioinformatics Review, vol. 19, no. 8, pp. 755-763, 1998 • Karplus. K, Barrett. C, and Hughey. R, ``Hidden Markov models for detecting remote homologies’’, Bioinformatics, vol. 14, no. 10, pp. 846-856, 1998 • Madera. M, Gough, J, ``A comparison of profile hidden Markov model procedures for remote homology detection’’, Nucleic Acids Research, vol. 30, no. 19, pp. 4321-4328, 2002 • Gough. J, Karplus. K, Hughey. R, and Chothia. C, ``Assignment of Homology to Genome Sequences using a Library of Hidden Markov Models that represent all Proteins of known structure’’, J. Mol. Biol., 313, pp. 903-919, 2001

References • Yona. G, Levitt. M, ``Within the Twilight Zone: A sensitive Profile-Profile comparison tool based on Information Theory’’, J. Mol. Biol., 315, 1257-1275, 2002 • Mandera. M, Vogel. C, Kummerfeld. K, Chothia. C, and Gough. J, ``The SUPERFAMILY database in 2004: additions and improvements’’, Nucleic Acids Research, vol. 32, Database Issue, D235-239, 2004 • Bateman. A, Birney. E, Durbin. R, Eddy. S, Finn. R, Sonnhammer. E, ``Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins’’, Nucleic Acids Research, vol. 27, no. 1, 1999 • Andreeva. A, et. al., ``SCOP database in 2004: refinements integrate structure and sequence family data’’, Nucleic Acids Research, vol. 32, Database Issue, D226-D229,2004 • Many other online resources and tutorials

Profile Hidden Markov Models