Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

Introduction to Bioinformatics: Lecture XIIIProfile and Other Hidden Markov Models Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org

Outline of the lecture • Multiple alignments, family profiles and probabilistic models of biological sequences • From simple Markov models to Hidden Markov Models (HMMs) • Profile HMMs: topology and parameter optimization • Finding optimal alignments: the Viterbi algorithm • Other applications of HMMs JM - http://folding.chmcc.org

Web watch: personalized predictive medicine Targeting crucial signal transduction pathway in lung cancer: an inhibitor of the Epidermal Growth Factor Receptor (EGFR) catalytic activity that binds EGFRs with specific mutations. Genotyping the EGFR gene appears to be sufficient to predict the outcome of the therapy. Paez JG et. al. Science 304 JM - http://folding.chmcc.org

Hidden Markov Models for biological sequences • Problems with grammatical structure, such as gene finding, family profiles and protein function prediction, transmembrane domains prediction • In general, one may think of different biases in different fragments of the sequence (due to functional role for example) or of different states emitting these fragments using different probability distributions • Durbin et. al., Chapters 3 to 6 JM - http://folding.chmcc.org

Example: Markov chain model for CpG islands Motivation: CpG dinucleotides (and not the C-G bas pairs across the two strands) are frequently methylated at C, with methyl-C mutating with a higher rate into a T; however, the methylation process is suppressed around regulatory sequences (e.g. promoters) where CpG islands occur more often. Transition probabilities: tT,G=P(ai=G | ai-1=T) etc. A T C G The overall probability of a sequence defined as product of transition probabilities JM - http://folding.chmcc.org

Example: Hidden Markov model for CpG islands A* T* A T C* G* C G Adding four more states (A*,C*,T*,G*) to represent the “island” model, as opposed to non-island model with unlikely transitions between the models one obtains a “hidden” MM for CpG islands. There is no longer one-to-one correspondence between the states and the symbols and knowing the sequence we cannot tell state the model was in when generating subsequent letters in the sequence. JM - http://folding.chmcc.org

Probabilistic models of biological sequences For any probabilistic model the total probability of observing a sequence a1a2…an may be written as: P(a1a2…an) = P(an| an-1… a1) P(an-1| an-2… a1) … P(a1) In Markov chain models we simply have: P(a1a2…an) = P(an| an-1) P(an-1| an-2) … P(a1) HMMs are generalization of Markov chain models, with some “hidden” states that “emit” sequence symbols according to certain probability distributions and (Markov) transitions between pairs of hidden states JM - http://folding.chmcc.org

HMMs as probabilistic linguistic models HMMs may be in fact regarded as probabilistic, finite automata that generate certain “languages”: sets of words (sentences etc.) with specific “grammatical” structure. For example, promoter, start, exon, splice junction, intron, stop “states” will appear in a linguistic model of a gene, whereas column (sequence position), insert and deletion states will be employed in a linguistic model of a (protein) family profile. JM - http://folding.chmcc.org

HMMs for gene prediction: an exon model JM - http://folding.chmcc.org

HMMs and the supervised learning approach • Given a training set of aligned sequences find optimal transition and emission probabilities that maximize probability of observing the training sequences – Baum-Welch (Expectation Maximization) or Viterbi training algorithm • In recognition phase, having the optimized probabilities, we ask what is the likelihood that a new sequence belongs to a family i.e. it is generated by the HMM with sufficiently high probability. The Viterbi algorithm, which is in fact dynamic programming in a suitable formulation, is used to find an optimal path through the states, which defines the optimal alignment JM - http://folding.chmcc.org

Ungapped profiles and the corresponding HMMs Each blue square represents a match state that “emits” each letter with certain probability ej(a) which is defined by frequency of a at position j: Beg … Mj … End Example AGAAACT AGGAATT TGAATCT P(AGAAACT)=16/81 P(TGGATTT)=1/81 Typically, pseudo-counts are added in HMMs to avoid zero probabilities. JM - http://folding.chmcc.org

HMMs and likelihood optimization JM - http://folding.chmcc.org

Likelihood optimization … JM - http://folding.chmcc.org

Insertions and deletions in profile HMMs Ij Beg Mj End Insert states emit symbols just like the match states, however, the emission probabilities are typically assumed to follow the background distribution and thus do not contribute to log-odds scores. Transitions Ij -> Ij are allowed and account for an arbitrary number of inserted residues that are effectively unaligned (their order within an inserted region is arbitrary). JM - http://folding.chmcc.org

Insertions and deletions in profile HMMs Dj Beg Mj End Deletions are represented by silent states which do not emit any letters. A sequence of deletions (with D -> D transitions) may be used to connect any two match states, accounting for segments of the multiple alignment that are not aligned to any symbol in a query sequence (string). The total cost of a deletion is the sum of the costs of individual transitions (M->D, D->D, D->M) that define this deletion. As in case of insertions, both linear and affine gap penalties can be easily incorporated in this scheme. JM - http://folding.chmcc.org

Gap penalties: evolutionary and computational considerations • Linear gap penalties: g(k) = - k d for a gap of length k and constant d • Affine gap penalties: g(k) = - [ d + (k -1) e ] where d is opening gap penalty and e an extension gap penalty. JM - http://folding.chmcc.org

Profile HMMs as a model for multiple alignments Dj Ij Beg Mj End Example AG---C A-AG-C AG-AA- --AAAC AG---C ** * JM - http://folding.chmcc.org

Observed emission and transition counts C0 C1 C2 C3 AG...C A-AG.C AGAA.- --AAAC AG...C 1 Dj 1 2 1 1 Ij 1 4 2 1 Beg 3 Mj 2 4 End 4 Match emissions Insert emissions JM - http://folding.chmcc.org

Computing emission and transition probabilities JM - http://folding.chmcc.org

Optimal alignment corresponds to a path with the highest probability (or log-odds score) Dj Ij Beg Mj End Problem Given the above model, with emission and transition probabilities obtained previously, find the optimal path (alignment) for the query sequence AGAC Problem Find emission and transition counts assuming that the 4th column in the example of multiple alignment in slide 15 corresponds to another match state (and not an insert state) JM - http://folding.chmcc.org

Outline of the Viterbi algorithm Dj Ij Beg Mj End JM - http://folding.chmcc.org

Profile HMMs for local alignments The trick consists of adding additional insert states Q that model flanking unaligned sequences using background frequencies qa and large tQ,Q Dj Ij Mj End Beg Q Q JM - http://folding.chmcc.org

Summary • In general, when the states generating training sequences (alignments) are not known an iterative procedure • Problem with local minima, topology choice (length of the profile) • Excellent results in family assignment (SAM, PFAM), gene prediction, trans-membrane domain recognition etc. JM - http://folding.chmcc.org

Outline of the lecture JM - http://folding.chmcc.org

Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models

Presentation Transcript

Algorithmic Trading: An Overview of Applications And Models.

Markov Random Fields

ECSE-6290 Semiconductor Devices and Models II Lecture 18

Introduction to Medical Decision Making and Decision Analysis

Lecture 2

Markov Decision Processes: A Survey

Introduction to Bioinformatics

Stock Returns Predictability using Markov Regime Switching Model

Markov Logic

Planning under Uncertainty with Markov Decision Processes: Lecture II

6-1 Introduction To Empirical Models

Website for the lecture notes

CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 13

The Rendering Process Hidden Line and Hidden Surface Removal (HLHSR)

Bioinformatics For MNW 2 nd Year

CS 6293 Advanced Topics: Translational Bioinformatics

Mining the Biomedical Literature

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Some topics in Bioinformatics: An introduction 1, Primary mathematical statistics