470 likes | 583 Views
Profile HMMs for sequence families and Viterbi equations. Linda Muselaars and Miranda Stobbe. Example alignment. HBA_HUMAN –HGSAQVKGHGKKVADALTNAVAHV- HBB_HUMAN VMGNPKVKAHGKKVLGAFSDGLAHL- MYG_PHYCA MKASEDLKKHGVTVLTALGAILKK-- GLB3_CHITP IKGTAPFETHANRIVGFFSKIIGEL-
E N D
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe
Example alignment HBA_HUMAN –HGSAQVKGHGKKVADALTNAVAHV- HBB_HUMAN VMGNPKVKAHGKKVLGAFSDGLAHL- MYG_PHYCA MKASEDLKKHGVTVLTALGAILKK-- GLB3_CHITP IKGTAPFETHANRIVGFFSKIIGEL- GLB5_PETMA LKKSADVRWHAERIINAVNDAVASM- LGB2_LUPLU PQNNPELQAHAGKVFKLVYEAAIQLQ GLB1_GLYDI ---DPGVAALGAKVLAQIGVAVSHL- Linda Muselaars and Miranda Stobbe
Overview chapter 5 • Ungapped score matrices. • Adding insert and delete states to obtain profile HMMs. • Deriving profile HMMs from multiple alignments • Searching with profile HMMs. • Profile HMM variants for non-global alignments. • More on estimation of probabilities. • Optimal model construction. • Weighting training sequences. Linda Muselaars and Miranda Stobbe
Overview chapter 5 • Ungapped score matrices. • Adding insert and delete states to obtain profile HMMs. • Deriving profile HMMs from multiple alignments • Searching with profile HMMs. • Profile HMM variants for non-global alignments. • More on estimation of probabilities. • Optimal model construction. • Weighting training sequences. Linda Muselaars and Miranda Stobbe
Key-issues • Identifying the relationship of an individual sequence to a sequence family. • How to build a profile HMM. • Use profile HMMs to detect potential membership in a family. • Use profile HMMs to give an alignment of a sequence to the family. Linda Muselaars and Miranda Stobbe
Key-issues (2) Lollypops for a valuable (up to the speakers to decide) contribution to this lecture. Linda Muselaars and Miranda Stobbe
Needed theory • Emission probabilities. • Silent states. • Pair HMMs. • The Viterbi algorithm. • The Forward algorithm. Linda Muselaars and Miranda Stobbe
Contents • Ungapped score matrices. • Adding insert and delete states to obtain profile HMMs. • Deriving profile HMMs from multiple alignments. • Non-probabilistic profiles • Basic profile HMM parameterisation • Searching with profile HMMs. • Profile HMM variants for non-global alignments. Linda Muselaars and Miranda Stobbe
Example alignment HBA_HUMAN –HGSAQVKGHGKKVADALTNAVAHV- HBB_HUMAN VMGNPKVKAHGKKVLGAFSDGLAHL- MYG_PHYCA MKASEDLKKHGVTVLTALGAILKK-- GLB3_CHITP IKGTAPFETHANRIVGFFSKIIGEL- GLB5_PETMA LKKSADVRWHAERIINAVNDAVASM- LGB2_LUPLU PQNNPELQAHAGKVFKLVYEAAIQLQ GLB1_GLYDI ---DPGVAALGAKVLAQIGVAVSHL- ********************* Linda Muselaars and Miranda Stobbe
Ungapped regions • Gaps tend to line up. • We can consider models for ungapped regions. • Specify indepependent probabilities ei(a). • But of course: log-odds ratio! • Position specific score matrix. Linda Muselaars and Miranda Stobbe
Drawbacks • Multiple alignments do have gaps. • Need to be accounted for. • For example: BLOCKS database, with combined scores of ungapped regions. • We will develop a single probabilistic model for the whole extent of the alignment. Linda Muselaars and Miranda Stobbe
Contents • Ungapped score matrices. • Adding insert and delete states to obtain profile HMMs. • Deriving profile HMMs from multiple alignments. • Non-probabilistic profiles • Basic profile HMM parameterisation • Searching with profile HMMs. • Profile HMM variants for non-global alignments. Linda Muselaars and Miranda Stobbe
Short review • Emission probabilities: the probability that a certain symbol is seen when in certain state k. • Silent states: states that do not emit symbols in an HMM. Linda Muselaars and Miranda Stobbe
Mj Building the model (1) • We need position sensitive gap scores. • HMM with repetitive structure of (match) states. • Transitions of probability 1. • Emmision probabilities: eMi(a). .... .... Begin End Linda Muselaars and Miranda Stobbe
Ij Building the model (2) • Deal with insertions: set of new states Ii. • Ii have emission distribution eIi(a). • Set to the background distribution qa. Begin Mj End Linda Muselaars and Miranda Stobbe
Dj Building the model (3) • Deal with deletions. • Possibly forward jumps. • For arbitrarily long gaps: silent states Dj . Begin Mj End Linda Muselaars and Miranda Stobbe
Costs for additional states • States for insertions: the sum of the costs of the transitions and emissions (M→ I, number of I→ I, I→ M). • States for deletions: the sum of the costs of an M→ D transition and a number of D→ D transitions and an D→ M transition. Linda Muselaars and Miranda Stobbe
Dj Ij Full model Begin Mj End Linda Muselaars and Miranda Stobbe
Comparison with pair HMM X qxi X qxi M pxiyj Begin End Y qyj Y qyj Linda Muselaars and Miranda Stobbe
Contents • Ungapped score matrices. • Adding insert and delete states to obtain profile HMMs. • Deriving profile HMMs from multiple alignments. • Non-probabilistic profiles • Basic profile HMM parameterisation • Searching with profile HMMs. • Profile HMM variants for non-global alignments. Linda Muselaars and Miranda Stobbe
Non-probabilistic profiles • Profile HMM without underlying probabilistic model. • Set scores to averages of standard substitution scores. • Anomalies: • Conservation of columns is not taken into account. • Scores for gaps do not behave properly. Linda Muselaars and Miranda Stobbe
Example HBA_HUMAN ...VGA--HAGEY... HBB_HUMAN ...V----NVDEV... MYG_PHYCA ...VEA--DVAGH... GLB3_CHITP ...VKG------D... GLB5_PETMA ...VYS--TYETS... LGB2_LUPLU ...FNA--NIPKH... GLB1_GLYDI ...IAGADNGAGV... *** ***** The score for residue a in column 1 would be set to: Linda Muselaars and Miranda Stobbe
Basic profile HMM parameterisation • Objective: make the probability distribution peak around members of the family. • Available parameters: • Length of the model. • Transition and emission probabilities. Linda Muselaars and Miranda Stobbe
Length of the model • Which multiple alignment columns do we assign to match states? • And which to insert states? • Heuristic rule: Columns that consist for more than 50% of gap characters should be modeled by insert states. Linda Muselaars and Miranda Stobbe
# of transitions from state k to state l # of transitions from state k to any other state Probability parameters • Transition probability: • Emission probability: • In the limit this is an accurate and consistent estimation. • Pseudocount method: LaPlace’s rule. Linda Muselaars and Miranda Stobbe
Example Linda Muselaars and Miranda Stobbe
A 5/8 C 1/8 G 1/8 T 1/8 A 3/7 C 1/7 G 2/7 T 1/7 A 1/8 C 5/8 G 1/8 T 1/8 A 1/7 C 1/7 G 4/7 T 1/7 Example continued D1 D2 D3 D4 I0 I1 I3 I4 I2 End Begin A C G T A C G T A C G T A C G T aM1M2 = 4/7 aM1D2 = 2/7 aM1I1 = 1/7 M1 M2 M3 M4 Linda Muselaars and Miranda Stobbe
Contents • Ungapped score matrices. • Adding insert and delete states to obtain profile HMMs. • Deriving profile HMMs from multiple alignments. • Non-probabilistic profiles • Basic profile HMM parameterisation • Searching with profile HMMs. • Profile HMM variants for non-global alignments. Linda Muselaars and Miranda Stobbe
Searching with profile HMMs • Obtaining significant matches of a sequence to the profile HMM: • Viterbi algorithm: P(x, π*| M). • Forward algorithm: P(x | M). • Give an alignment of a sequence to the family. • Highest scoring, or Viterbi, alignment. Linda Muselaars and Miranda Stobbe
Viterbi equations • Log-odds score of best path matching subsequence x1…i to the submodel up to state j, ending with xi being emitted by state Mj: • Log-odds score of the best path ending in xi being emitted by Ij: • The best path ending in state Dj: • Pair HMM: Linda Muselaars and Miranda Stobbe
Viterbi equations Linda Muselaars and Miranda Stobbe
Forward algorithm Linda Muselaars and Miranda Stobbe
Initialisation and termination • Viterbi algorithm: • Initialisation: • Termination: • Forward algorithm: • Initialisation: • Termination: Linda Muselaars and Miranda Stobbe
Alternative to log-odds scoring • Log Likelihood score (LL score) • Strongly length dependent. • Solutions: • Divide by sequence length • Z-score • Which method is preferred? Linda Muselaars and Miranda Stobbe
Demo Linda Muselaars and Miranda Stobbe
Part of the profile HMM Linda Muselaars and Miranda Stobbe
Scoring Linda Muselaars and Miranda Stobbe
Part of the multiple alignment Linda Muselaars and Miranda Stobbe
Relative frequencies Linda Muselaars and Miranda Stobbe
Contents • Ungapped score matrices. • Adding insert and delete states to obtain profile HMMs. • Deriving profile HMMs from multiple alignments. • Non-probabilistic profiles • Basic profile HMM parameterisation • Searching with profile HMMs. • Profile HMM variants for non-global alignments. Linda Muselaars and Miranda Stobbe
Flanking model states • Used to model the flanking sequences to the actual profile match itself. • Extra probabilities needed: • Emission probability: qa. • ‘Looping’ transition probability: (1 - η). • Transition probability from left flanking state: depends on application. Linda Muselaars and Miranda Stobbe
Begin End Q Q Model for local alignment Smith-Waterman style Dj Ij Begin Mj End Linda Muselaars and Miranda Stobbe
Model for overlap matches Dj Q Q Ij Begin Mj End Linda Muselaars and Miranda Stobbe
Q Begin End Model for repeat matches Dj Ij Begin Mj End Linda Muselaars and Miranda Stobbe
Summary • Construction of a profile HMM for different kinds of alignments. • Use profile HMMs to detect potential membership in a family. • Use profile HMMs to give an alignment of a sequence to the family. Linda Muselaars and Miranda Stobbe
Discussion subject BLAST versus profile HMM Linda Muselaars and Miranda Stobbe