770 likes | 897 Views
Chapter 5. Profile HMMs for Sequence Families. What have we done?. So far, we have concentrated on the intrinsic properties of single sequences (CpG islands), or on pairwise alignment of sequences Functional biological sequences typically come in families
E N D
Chapter 5 Profile HMMs for Sequence Families
What have we done? • So far, we have concentrated on the intrinsic properties of single sequences (CpG islands), or on pairwise alignment of sequences • Functional biological sequences typically come in families • Many of the most powerful sequence analysis methods are based on identifying the relationship of an individual sequence to a sequence family
Sequence Families • Sequences in a family have diverged from each other in their primary sequence during evolution, but normally maintain the same or a related function • Thus, identifying that a sequence belongs to a family, and aligning it to the other members, often tells about its function
Sequence Family • If we have a set of sequences belonging to a family, we can perform a database search for more members using pairwise alignment with one of the known family members as the query sequence • We could even search with the known members one by one • However, pairwise searching with any one of the members may not find sequences distantly related to the ones we have already
Sequence Family • An alternative approach is to use statistical features of the whole set of sequences in the search • Similarly, even when family membership is clear, accurate alignment can be often improved significantly by concentrating on features that are conserved in the whole family
Sequence Family • A pairwise alignment captures much of the relationship between two sequences • A multiple alignment can show how the sequences in a family relate to each other
Globin Family • It is clear that some positions in the globin alignment are more conserved than others • The helices are more conserved than the loop regions between them • Certain residues are particularly strongly conserved (two conserved histidines H) • When identifying a new sequence as a globin, it would be desirable to concentrate on checking that these more conserved features are present
HMM Profile • Consensus modeling of the family using a probabilistic model • Built from a given multiple alignment (assumed to be correct!) • Develop a particular type of hidden Markov model well suited to modelling multiple alignments • We call these profile HMMs
HMM Profile • We will assume that we are given a correct multiple alignment, from which we will build a model that can be used to find and score potential matches to new sequences
Ungapped Score Matrices • A natural probabilistic model for a conserved region would be to specify independent probabilities ei(a) of observing amino acid a in position i • The probability of a new sequence x according to this model is
Log-odds ratio • We are interested in the ratio of the probability to the probability of x under the random model • The values log(ei(a)/qa) behave like elements in a score matrix s(a,b), where the second index is position i, rather than amino acid b Position specific score matrix (PSSM)
Adding Insert and Delete States • A PSSM might capture some conservation information, it is clearly an inadequate representation of all the information in a multiple alignment of a protein family. • We have to find some way to take account of gaps
Components of profile HMMs • Consideration of gaps • Henikoff & Henikoff [1991] • Combining the multiple ungapped block models • Allowing gaps at each position using the same gap scores (g) at each position (where gaps are more or less likely?) • Profile HMMs • Repetitive structure of states • Different probabilities in each position • Full probabilistic model for sequences in the sequence family
Components of profile HMMs • The PSSM can be viewed as a trival HMM with a series of identical states that we will call match states, separated by transitions of probability 1 • Alignment is trivial since there is no choice of transitions. • Match states • Emission probabilities .... .... Begin Mj End
Ungapped profiles and the corresponding HMMs Each blue square represents a match state that “emits” each letter with certain probability ej(a) which is defined by frequency of a at position j: Beg … Mj … End Example AGAAACT AGGAATT TGAATCT P(AGAAACT)=16/81 P(TGGATTT)=1/81 Typically, pseudo-counts are added in HMMs to avoid zero probabilities.
Insertions and deletions in profile HMMs Ij Beg Mj End Insert states emit symbols just like the match states, however, the emission probabilities are typically assumed to follow the background distribution and thus do not contribute to log-odds scores. Transitions Ij -> Ij are allowed and account for an arbitrary number of inserted residues that are effectively unaligned (their order within an inserted region is arbitrary).
Components of profile HMMs • Insert states • Emission prob. • Usually back ground distribution qa. • Transition prob. • Mi to Ii, Ii to itself, Ii to Mi+1 • Log-odds score of a gap of length k (no logg-odds from emission) Ij Begin Mj End
Insertions and deletions in profile HMMs Dj Beg Mj End Deletions are represented by silent states which do not emit any letters. A sequence of deletions (with D -> D transitions) may be used to connect any two match states, accounting for segments of the multiple alignment that are not aligned to any symbol in a query sequence (string). The total cost of a deletion is the sum of the costs of individual transitions (M->D, D->D, D->M) that define this deletion. As in case of insertions, both linear and affine gap penalties can be easily incorporated in this scheme.
Components of profile HMMs • Delete states • No emission prob. • Cost of a deletion • M→D, D→D, D→M • Each D→D might be different, each I→I will have the same cost Dj Begin Mj End
Gap penalties: evolutionary and computational considerations • Linear gap penalties: g(k) = - k d for a gap of length k and constant d • Affine gap penalties: g(k) = - [ d + (k -1) e ] where d is opening gap penalty and e an extension gap penalty.
Components of profile HMMs • Combining all parts Dj Ij Begin Mj End Figure 5.2 The transition structure of a profile HMM.
Profile HMMs as a model for multiple alignments Dj Ij Beg Mj End Example AG---C A-AG-C AG-AA- --AAAC AG---C ** *
Profile HMMs generalize pairwise alignment • Mj→M, Ij→X (insertion), Dj→Y (deletion) • eMi(a)=pyia /qa the conditional probabilities of seeing a given yi in a pairwise alignment • aMiIi= aMiDi+1=δ, aIiIi= aDiDi+1=ε
Deriving Profile HMMs From Multiple Alignments • The key idea behind profile HMMs is that we can use the same structure as shown in Figure 5.2, but set the transition and emission probabilities to capture specific information about each position in the multiple alignment of the whole family • Essentially, we want to build a model representing the consensus sequence for a family, rather than the sequence of any particular member • Non-probabilistic profiles and profile HMMs
Non-probabilistic Profiles • A model similar to the profile HMM was first introduced by Gribskov, McLachlan and Eisenberg 1987 • No underlying probabilistic model, but rather assigned position specific scores for each match state and gap penalty • The score for each consensus position is set to the average of the standard substitution scores from all the residues in the corresponding multiple sequence alignment column
HMMs from multiple alignments • Key idea behind profile HMMs • Model representing the consensus for the family • Not the sequence of any particular member HBA_HUMAN ...VGA--HAGEY... HBB_HUMAN ...V----NVDEV... MYG_PHYCA ...VEA--DVAGH... GLB3_CHITP ...VKG------D... GLB5_PETMA ...VYS--TYETS... LGB2_LUPLU ...FNA--NIPKH... GLB1_GLYDI ...IAGADNGAGV... *** ***** Figure 5.3 Ten columns from the multiple alignment of seven globin protein sequences shown in Figure 5.1 The starred columns are ones that will be treated as ‘matches’ in the profile HMM.
Non-probabilistic Profiles The score for residue ‘a’ in column 1 s(a,b) : standard substitution matrix
HMMs from multiple alignments • Non-probabilistic profiles • Gribskov, Mclachlan & Eisenberg [1987] • Score for residue a in column 1 • Disadvantages • More conserved region might be corrupted. • Intuition about the likelihood can’t be maintained. • The score for gaps do not behave as expected.
Non-probabilistic Profiles • They also set gap penalties for each column using a heuristic equation that decrease the cost of a gap according to the length of the longest gap observed in the multiple alignment spanning the column
Problem With The Approach • If we had an alignment with 100 sequences, all with a cysteine (C), at some position, the probability distribution for that column for an “average” profile would be exactly the same as would be derived from a single sequence • Doesn’t correspond to our expectation that the likelihood of a cysteine should go up as we see more confirming examples
Similar Problem With Gaps Scores for a deletion in columns 2 and 4 would be set to the same value More reasonable to set the probability of a new gap opening to be higher in column 4
Basic Profile HMM Parameterization • HMM profiles have emission and transition probabilities • Assuming that these probabilities are non-zero, a profile HMM can model any possible sequence of residues from the given alphabet • A profile HMM defines a probability distribution over the whole space of sequences • The aim of parameterization is to make this distribution peak around members of the family • Parameters: probabilities and the length of the model (control the shape of the distribution)
Model Length • The choice of length of the model corresponds more precisely to a decision on which multiple alignment columns to assign to match states, and which to assign to insert states • The consensus sequence at Figure 5.3 should only have 8 residues, and that the two non-starred residues in GLB1_GLYDI should be treated as an insertion with respect to the consensus • Should decide which columns should correspond to match states, and which to inserts • A simple rule that works well in practice is that columns that are more than half gap characters should be modeled by inserts
Probability Values k,l : indices over states akl and ek : transition and emission probabilities Akl and Ek : transition and emission frequencies
Problem With The Approach • Transitions and emissions that don’t appear in the training dataset would acquire zero probability (would never be allowed) • Solution: add pseudocounts to the observed frequencies • Simplest pseudocount method is Laplace’s rule: add one to each frequency
Searching With Profile HMMs • One of the main purposes of developing profile HMMs is to use them to detect potential membership in a family by obtaining significant matches of a sequence to the profile HMM • We assume that we are looking for global matches • We can use either the Viterbi algorithm to get the most probable alignment or the forward algorithm to calculate the full probability of the sequence summed over all possible paths
Searching with profile HMMs • Maintaining log-odd ratio compared with random model
Viterbi Equations • VjM(i) is the log-odds score of the best path matching subsequence x1…i to the submodel up to state j, ending with xibeing emitted by state Mj • VjI(i) is the score of the best path ending in xibeing emittedby Ij, and VjD(i) for the best path ending in state Dj
Viterbi Equations • In a typical case, there is no emission score eIj(xi) in the equation for VjI(i) since we assume that the emission distribution from the insert state Ij is the same as the background distribution, so the probabilities cancel in the log-odds form • Also the D→I and I→D trsnsition terms may not be present
Variants for non-global alignments • We can generalize profile HMMs for local, repeat and overlap alignments • The profile HMM for local algorithm is to specify a new model for the complete sequence x ,which incorporates the original profile HMM together with one or more copies of a simple selflooping model that is used to account for the regions of unaligned sequence
Variants for non-global alignments • Notice that as well as specifying the emission probabilities of the new states, which will normally of course be qa , we must specify a number of new transition probabilities • The looping probability on the flanking states should be close to 1, since they must account for long stretches of sequence • Let us set these to (1-η)
Variants for non-global alignments • For the transition probabilities from the left flanking state to different start points in the model, we can give them equal probabilities, η/L • Another option is to assign more probability to starting at the beginning of the model. That is the option used in HMMER package (Eddy 1996)
Variants for non-global alignments • Local alignments (flanking model) • Emission prob. in flanking states use background values qa. • Looping prob. close to 1, e.g. (1- ) for some small . Dj Ij Mj Begin End Q Q
Variants for non-global alignments • Overlap alignments • Only transitions to the first model state are allowed. • When expecting to find either present as a whole or absent • Transition to first delete state allows missing first residue Dj Q Q Ij Begin Mj End