650 likes | 784 Views
Hidden Markov Models -- Introduction. Introduction. All prior models of speech are nonparametric and non--statistical. Hence estimates of Variables are uniformed by the relative deviations of models Hidden Markov Models
E N D
Introduction • All prior models of speech are nonparametric and non--statistical. • Hence estimates of Variables are uniformed by the relative deviations of models • Hidden Markov Models • An attempt to reproduce the statistical fluctuation in speech across a small utterance • An attempt which has a training theory which is well motivated 2
This Lecture • What is a Hidden Markov Model • What are the various types of estimation procedures used How does one optimize the performance of a hidden Markov Model. How can the model be extended to more general cases of models.
Agenda • Markov Chains -- How to estimate probabilities • Hidden Markov Models • definition • how to identify • how to choose parameters • how to optimize parameters to produce the best models • Types of Hidden Markov Models
Agenda II • Next Lecture Different Type of Hidden Markov Models. • Distinct implimentation details.
Overview • Techniques of Choosing Hidden Markov Models and estimating parameters • Related to Dynamic Programming already done. • Quantities Recursively defined • Key Difference • Can estimate true probabilities and effectively variances and weight estimates • Estimation Time Surprisingly fast.
Vocabulary • Hidden Markov Model • Much more below, but a doubly stochastic model, the underlying states are Markov, the output states are produced by a random process. • Alpha Terminal, Beta Terminal • Alpha terminal, the probability of the initial portion of a state sequence given it ends in a particular state. • Beta terminal, the probability that a terminal sequence starts in state s
Vocabulary II • Maximum Likelihood Estimation • Choosing parameters of the set so that the probability of the observation sequence is Maximized • The classical principle for statistical inference, others benchmarked against MLE • Sufficient Statistics. • Functions of the input data which bear on the parametric form of the distribution. • If you know the sufficient statistics you know everything that the data can provide about the unknown parameters
Vocabulary III • Jensen’s Inequality • For convex functions and any probability distribution • E(f(x))>f(E(x)) I.e. E(X*X)>=E(x)*E(x) • For concave functions • E(log(x))<=logE(x) 9
Hidden Markov Models • Introduction to the basic properties of discrete Markov Chains, their relationship to Hidden Markov Models • Definition of a Hidden Markov Model • Their use in discrete word recognition • Techniques to Evaluate and Train Discrete Hidden Markov Models 10
Stationary Markov Chains -- The Weather Model Rainy Sunny Snowy Cloudy a22 a11 a12 a21 a32 a33 a23 a44 a34 a43 • Where ajk is the probability of changing from weather state i to weather state k.
Facts About the Weather Model • As drawn the model is recurrent, I.e. any state can connect to any other, this structure is an assumption of the model Transition probabilities are “directly observable” in the sense that one can average numbers of transitions of an observed type from a given observed state For Example, one can calculate the average number of times that it rains in the next epoch given its cloudy now.
Rigorous Definition • Markov Chain • Consists of a sequence of states v1…vn. At regular fixed interval of the time the system transfers from its state at time t, qt to its state at time t+1, qt+1 • Furthermore, • Only, memory is used for transition probabilities
Hidden Markov ModelVs Markov Chain • Markov chains have entirely observable states. However a “Hidden Markov Model” is a model of a Markov Source which admits an element each time slot depending upon the state. The states are not directly observed • For instance...
Markov Chain and Urn Model URN 1 URN N-1 URN N q1 qn-1 qn P(R)= P(G)= P(B)= P(R)= P(G)= P(B)= P(R)= P(G)= P(B)= • Suppose States are hidden • Consider Urn model • Colored balls in each Urn • Observer sees only the balls selected out of each slot
Operation of the Model • I. Step 1 • One is in a state corresponding to an URN qi • II. Step II • Select a colored ball at random out of this URN. The observer sees the ball replace it. • III.Step III • Flip a biased die or chose a special ball out of another urn corresponding to the one selected. Then replace the ball. • Note • The observer only sees a sequence of colors
Formal Definition • A hidden Markov model is a triple (a,b,p) where • A Hidden Markov Model is a triple (A,B,p) where • Outputs are generated in the following manner
Output Generation • 1. Choose an initial state in accord with the starting distribution P • 2. Set t=1 • 3. Choose Ok in accord with • 4. Choose qt+1 in accord with A i.e. • 5. Set t=t+1 and return to 3
Problems Using Hidden Markov Models • Its hard a priori to say what is the best structure for a HMM for a given problem. • Empirically, many models of a given complexity often produce a similar fit, hence its hard to identify models. • It’s possible now, due to Amari, to say whether or not two models are stochastically equivalent. I.E. Generate same proabilities, • Metric on HMM’s. • (Usually probability 0).
Criticism Leveled Against HMM’s: Somewhat Bogus • For a hidden Markov model • The past history is reflected only in the last state that the sequence is in. Therefore prior history cannot be influencing result. Speech because of coarticulation is dependent upon prior history. /pinz/ /pits/ • There can be no backward effects. • There can be no effects of “future” utterances on present, I.e. backwards assimilation, • grey chips, Vs grey ship., great chip
Answers to Criticism • First Objection. • Markov model by itself cannot handle this elementary. However, distortion coefficients delta coefficients effectively convey framed information about locally prior parts of the utterance. • Second Objection • Shows that speech has to be locally buffered and conclusion about a phoneme cannot be made without a limited lookahead like people due. Can easily construct a Markov model to do this
No ideal method to determine • Best Model for Phone, Word Sentence. • However, • In fact, they are the only existing statistical models of speech recognition. • Can be use to self--validate as well as recognize, validate significance
Summary • Cannot Directly identify HMM structure, however, can still use model and assume the speech source obeys the given structure. • BUT • If cannot choose suitable parameters for the model it turns out to be useless. • This problem has been solved
History • Technique originated by Leonard Baum. • Baum (1966), • Author, wrote 3 or 4 papers, math journals. • Probably most important innovation in mathematical statistics, at time. • Took about 10 years for Fred Jelinek and baker to pick up for speech. • Now used all over the place, popularized by A.P. Dempster and Rubin at Harvard.
Preconditions • For speech recognition application suppose that frames are Vector Quantized codewords representing the speech signal See later Hidden Markov models can do their own quantization. However, this case treated first for simplicity.
Three Basic Prerequisites for Hidden Markov Model Use • Problem I • Given an observation sequence, O1,…OT and L=(A,B,P) how does one compute the probability P(O| L) • Problem II • Given the observation sequence O1,…OT how can one find a state sequence which is optimal in some sense
Problem III • Given a training sequence how do we train the model O=O1…OT to maximize P(O|L). • Hidden Markov models are a form of maximal likelihood estimation. In principal one can use them to do statistical tests of hypotheses, in particular tests values of certain parameters … • Maximal Likelihood estimation is a method which is know to be asymptotically optimal for estimating the parameters. Implicitly minimizing the probability of error sequences.
Solutions to the Three Hidden Markov Problems • Problem I. • Given an observation sequence how do we compute its likelihood. • Solution • Brute force • 1. Enumerate a state sequence q1,…qt=I • 2. Calculate output probabilities • 3.Calculate transitional probabilities • .
Problem I, Brute Force Continued • Sum over all sequences of length T • Method is exponential in complexity, requires approximately 2TNT computations, totally intractable But this can be shown to be of complexity TN!
How to Solve Problem • Define • This function called the a terminal is the probability of starting an observation and ending up in state t. There are TN of these alpha terminals and they can be calculated recursively • This function called the b terminal is the probability that one has a given terminal sequence given that one starts at time t in state
Initialization Computation Trellis a1 • Recursion a1k aj ak • Termination ajk t t-1 Forward Algorithm • Using a and b terminals defined recursively, one can compute the answer to these questions in NT steps. First in the Forward Direction, i.e the forward algorithm bk(Ot)
Forward Algorithm Explanation • Key Recursion • Sum of products of three terms • To calculate the probability of a initial sequence ending in state j, • Need to consider contribution from • Each prior state ending in state k • Consists of • alpha terminal • multiplied by corresponding transition probability • multiplied by probability of output state
Backward Algorithm • Very similar to the forward algorithm • Initialization Computation Trellis b1 • Recursion a1kb1(Ot) bk • Termination bj ajkbj(Ot) t t-1
Backward Algorithm Explanation • Backward Algorithm • Sum of products of three terms (as before) • Calculation probability of sequence ending in state j, • Need to consider contribution from • Each future state starting in state k • Consists of • beta terminal • multiplied by corresponding transition probability • multiplied by probability of output state
Problem II • How do we calculate the probability of the optimal state sequence. • Why bother • Often much faster than calculating probability of full observation sequence and then chosing maximum likelihood • One may want to “parse a long string to segment it” • Problem, what is the definition of optimality • Can choose the most likely state at each time but • May not even be a valid path: Why? • Commonly chosen definition of optimality Optimal Legal path
Algorithm: Viterbi Search • Should already be familiar from Dynamic Programming • Viterbi Search
Viterbi Search • Principle Same as dynamic programming principle discussed two lectures ago. Frequent Use • Multitude of paths through full model.
one one one one two two two two nine nine nine nine Example • Sequence Model • Word Model • Phone Model n L w
Frequent Use of Viterbi Search • Calculating the paths through the full model and full search for a large vocabulary model involves massive transitions through network. One can prune search at each stage by only considering transitions from states such that
Problem III • How do we train model given multiple observation sequences • No known way analytically to find formula which maximizes the probability of an observation sequence. There is an iterative procedure (Baum--Welsh) update, or EM algorithm which always increases P(O|L) until maximim is achieved
Need Certain Additional Quantities • Probability of Transferring from State k to state j at time t. • Probability of being in state i at time t given the model and observation sequence
Auxiliary Quantities II • is the expected number of transitions out of state i given the observation sequence and model • is the expected number of transitions from state I to state j given the observation sequence and the model
Baum Welch reupdate: EM algorithm • Start with estimates for (A,B,P) • Reestimate the parameters by calculating their most likely value. This amounts to in this case replacing the parameters by their expected value. • Given the observations estimate the sufficient statistics of the model, which are
Update Formula • Continue reupdating parameters until one obtains no significant change.
Properties of the Update Rule • For each revision of the parameters chosen of the likelihood sequence. • In other words, the likelihood of the observed data increases with every re--estimation of the parameters Unfortunately, local, not global maximum, (best one can do)
Baum Welch: EM reupdate • Like Gradient Ascent but with constant improvement. • Class of Algorithms called EM algorithm • Uses Auxiliary Function • Step I: Calculate its expectation • Step II: Maximize its expectation by • choosing new sets of parameters. • Step III: Iterate
EM interpretation • Auxiliary Function is Log probability of an observation sequence for a set of transitions Its natural to believe that if we maximize the expectation of the log probability then the by changing parameters the the overall log probability, likelihood will increase.
Proof: Result I • Need Two Results • says, log of the ratio of two sums greater than the average of the log of the probabiliies defined by summands in denominator • Proof • Direct application of Jensen’s inequality since log is concave • log(E(x))>Elog(x)
Result II • If xi are a vector of probabilities and if ci is a vector of positive numbers then • f(x)=icilog(xi) has a maximum when • xi=ci/ ici • Simple Use • Use method of Lagrange Multipliers, maximize
Likelihood Always Increases Using HMM learning • One does no worse than choose the current model. If we maximize Q, the the likelihood of the probabilities increase.