460 likes | 667 Views
Profile Hidden Markov Models. Mark Stamp. Hidden Markov Models. Here, we assume you know about HMMs If not, see “A revealing introduction to hidden Markov models” Executive summary of HMMs HMM is a machine learning technique Also, a discrete hill climb technique
E N D
Profile Hidden Markov Models • Mark Stamp PHMM
Hidden Markov Models PHMM • Here, we assume you know about HMMs • If not, see “A revealing introduction to hidden Markov models” • Executive summary of HMMs • HMM is a machine learning technique • Also, a discrete hill climb technique • Train model based on observation sequence • Score given sequence to see how closely it matches the model • Efficient algorithms, many useful applications
HMM Notation PHMM Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation:
Hidden Markov Models PHMM Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) Many more, and more all the time
Limitations of HMMs PHMM • Positional information not considered • HMM has no “memory” • Higher order models have some memory • But no explicit use of positional information • Does not handle insertions or deletions • These limitations are serious problems in some applications • In bioinformatics string comparison, sequence alignment is critical • Also, insertions and deletions occur
Profile HMM PHMM • Profile HMM (PHMM) designed to overcome limitations on previous slide • In some ways, PHMM easier than HMM • In some ways, PHMM more complex • The basic idea of PHMM • Define multiple B matrices • Almost like having an HMM for each position in sequence
PHMM PHMM • In bioinformatics, begin by aligning multiple related sequences • Multiple sequence alignment (MSA) • This is like training phase for HMM • Generate PHMM based on given MSA • Easy, once MSA is known • Hard part is generating MSA • Then can score sequences using PHMM • Use forward algorithm, like HMM
Training: PHMM vs HMM PHMM • Training PHMM • Determine MSA nontrivial • Determine PHMM matrices trivial • Training HMM • Append training sequences trivial • Determine HMM matrices nontrivial • These are opposites… • In some sense
Generic View of PHMM PHMM • Have delete, insert, and match states • Match states correspond to HMM states • Arrows are possible transitions • Each transition has a probability • Transition probabilities are A matrix • Emission probabilities are B matrices • In PHMM, observations are emissions • Match and insert states have emissions
Generic View of PHMM PHMM Circles are delete states, diamonds are insert states, squares are match states Also, begin and end states
PHMM Notation PHMM Notation
PHMM PHMM • Match state probabilities easily determined from MSA aMi,Mi+1 transitions between match states eMi(k) emission probability at match state • Many other transition probabilities • For example, aMi,Ii and aMi,Di+1 • Emissions at all match & insert states • Remember, emission == observation
Multiple Sequence Alignment PHMM • First we show MSA construction • This is the difficult part • Lots of ways to do this • “Best” way depends on specific problem • Then construct PHMM from MSA • This is the easy part • Standard algorithm for this • How to score a sequence? • Forward algorithm, similar to HMM
MSA PHMM • How to construct MSA? • Construct pairwise alignments • Combine pairwise alignments for MSA • Allow gaps to be inserted • To make better matches • Gaps tend to weaken PHMM scoring • A tradeoff between gaps and scoring
Global vs Local Alignment PHMM • In these pairwise alignment examples • “-” is gap • “|” means elements aligned • “*” for omitted beginning/ending symbols
Global vs Local Alignment PHMM • Global alignment is lossless • But gaps tend to proliferate • And gaps increase when we do MSA • More gaps, more random sequences match… • …and result is less useful for scoring • We usually only consider local alignment • That is, omit ends for better alignment • For simplicity, assume global alignment in examples presented here
Pairwise Alignment PHMM • Allow gaps when aligning • How to score an alignment? • Based on nxnsubstitution matrix S • Where n is number of symbols • What algorithm(s) to align sequences? • Usually, dynamic programming • Sometimes, HMM is used • Other? • Local alignment creates more issues
Pairwise Alignment PHMM • Example • Tradeoff gaps vs misaligned elements • Depends on matrix S and gap penalty
Substitution Matrix PHMM • Masquerade detection • Detect imposter using an account • Consider 4 different operations • E == send email • G == play games • C == C programming • J == Java programming • How similar are these to each other?
Substitution Matrix PHMM • Consider 4 different operations: • E, G, C, J • Possible substitution matrix: • Diagonal matches • High positive scores • Which others most similar? • J and C, so substituting C for J is a high score • Game playing/programming, very different • So substituting G for C is a negative score
Substitution Matrix PHMM • Depending on problem, might be easy or very difficult to find useful S matrix • Consider masquerade detection based on UNIX commands • Sometimes difficult to say how “close” 2 commands are • Suppose aligning DNA sequences • Biological rationale for closeness of symbols
Gap Penalty PHMM • Generally must allow gaps to be inserted • But gaps make alignment more generic • Less useful for scoring, so we penalize gaps • How to penalize gaps? • Linear gap penalty function: g(x) = ax (constant penalty for every gap) • Affine gap penalty function g(x) = a + b(x – 1) • Gap opening penalty a and constant penalty of bfor each extension of existing gap
Pairwise Alignment Algorithm PHMM • We use dynamic programming • Based on S matrix, gap penalty function • Notation:
Pairwise Alignment DP • Recursion: • where PHMM Initialization:
MSA from Pairwise Alignments PHMM • Given pairwise alignments… • How to construct MSA? • Generally use “progressive alignment” • Select one pairwise alignment • Select another and combine with first • Continue to add more until all are combined • Relatively easy (good) • Gaps proliferate, and it’s unstable (bad)
MSA from Pairwise Alignments PHMM • Lots of ways to improve on generic progressive alignment • Here, we mention one such approach • Not necessarily “best” or most popular • Feng-Dolittle progressive alignment • Compute scores for all pairs of n sequences • Select n-1 alignments that a) “connect” all sequences and b) maximize pairwise scores • Then generate a minimum spanning tree • For MSA, add sequences in the order that they appear in the spanning tree
MSA Construction PHMM • Create pairwise alignments • Generate substitution matrix • Dynamic program for pairwise alignments • Use pairwise alignments to make MSA • Use pairwise alignments to construct spanning tree (e.g., Prim’s Algorithm) • Add sequences to MSA in spanning tree order (from highest score, insert gaps as needed) • Note: gap penalty is used
MSA Example PHMM Suppose 10 sequences, with the following pairwise alignment scores
MSA Example: Spanning Tree PHMM Spanning tree based on scores So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9)
MSA Snapshot PHMM • Intermediate step and final • Use “+” for neutral symbol • Then “-” for gaps in MSA • Note increase in gaps
PHMM from MSA PHMM • In PHMM, determine match and insert states & probabilities from MSA • “Conservative” columns match states • Half or less of symbols are gaps • Other columns are insert states • Majority of symbols are gaps • Delete states are a separate issue
PHMM States from MSA PHMM • Consider a simpler MSA… • Columns 1,2,6 are match states 1,2,3, respectively • Since less than half gaps • Columns 3,4,5 are combined to form insert state 2 • Since more than half gaps • Match states between insert
Probabilities from MSA PHMM • Emission probabilities • Based on symbol distribution in match and insert states • State transition probs • Based on transitions in the MSA
Probabilities from MSA PHMM • Emission probabilities: • But 0 probabilities are bad • Model “overfits” the data • So, use “add one” rule • Add one to each numerator, add total to denominators
Probabilities from MSA PHMM • More emission probabilities: • But 0 probabilities still bad • Model “overfits” the data • Again, use “add one” rule • Add one to each numerator, add total to denominators
Probabilities from MSA PHMM • Transition probabilities: • We look at some examples • Note that “-” is delete state • First, consider begin state: • Again, use add one rule
Probabilities from MSA PHMM Transition probabilities When no information in MSA, set probs to uniform For example I1 does not appear in MSA, so
Probabilities from MSA PHMM Transition probabilities, another example What about transitions from state D1? Can only go to M2, so Again, use add one rule:
PHMM Emission Probabilities PHMM • Emission probabilities for the given MSA • Using add-one rule
PHMM Transition Probabilities PHMM • Transition probabilities for the given MSA • Using add-one rule
PHMM Summary PHMM • Construct pairwise alignments • Usually, use dynamic programming • Use these to construct MSA • Lots of ways to do this • Using MSA, determine probabilities • Emission probabilities • State transition probabilities • Then we have trained a PHMM • Now what???
PHMM Scoring PHMM • Want to score sequences to see how closely they match PHMM • How did we score using HMM? • Forward algorithm • How to score sequences with PHMM? • Forward algorithm (surprised?) • But, algorithm is a little more complex • Due to complex state transitions
Forward Algorithm PHMM • Notation • Indices i and j are columns in MSA • xi is ith observation symbol • qxi is distribution of xi in “random model” • Base case is • is score of x1,…,xiup to state j (note that in PHMM, i and j may not agree) • Some states undefined • Undefined states ignored in calculation
Forward Algorithm PHMM • Compute P(X|λ) recursively • Note that depends on , and • And corresponding state transition probs
PHMM PHMM • We will see examples of PHMM later • In particular, • Malware detection based on opcodes • Masquerade detection based on UNIX commands
References PHMM Durbin, et al, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids L. Huang and M. Stamp, Masquerade detection using profile hidden Markov models, Computers & Security, 30(8):732-747, 2011 S. Attaluri, S. McGhee, and M. Stamp, Profile hidden Markov models for metamorphic virus detection, Journal in Computer Virology, 5(2):151-169, 2009