340 likes | 601 Views
Isolated-Word Speech Recognition Using Hidden Markov Models. 6.962 Week 10 Presentation. Irina Medvedev Massachusetts Institute of Technology April 19, 2001. Outline. Markov Processes, Chains, Models Isolated-Word Speech Recognition ¤ Feature Analysis ¤ Unit Matching ¤ Training
E N D
Isolated-Word Speech Recognition Using Hidden Markov Models 6.962 Week 10 Presentation Irina Medvedev Massachusetts Institute of Technology April 19, 2001
Outline • Markov Processes, Chains, Models • Isolated-Word Speech Recognition ¤ Feature Analysis ¤ Unit Matching ¤ Training ¤ Recognition • Conclusions
Markov Process • The process, x(t), is first-order Markov if, for any set of ordered times, , • The current value of a Markov process depends all of the memory necessary to predict the future • The past does not add any additional information about the future
Markov Process • The transition probability density provides an important statistical description of a Markov process and is defined as • A complete specification of a Markov process consists of • first-order density : • transition density :
Markov Chains • A Markov chain can be used to describe a system which, at any time, belongs to one of N distinct states, • At regularly spaced times, the system may stay in the same state or transition to a different state • State at time t is denoted by qt Fully Connected Markov Model
Markov Chains • The state transition probabilities are made according to a set of probabilities associated with each state • These probabilities are stored in the state transition matrix • where N is the number of states in the Markov chain. • The state transition probabilities are • and have the properties and
Hidden Markov Models • Hidden Markov Models (HMMs) are used when the states are not observable events. • Instead, the observation is a probabilistic function of the state rather than the state itself • The states are described by a probability model • The HMM is a doubly embedded stochastic process
HMM Example: Coin Toss • How do we build an HMM to explain the observed sequence of head and tails? • Choose a 2-state model • Several possibilities exist 1-coin Model: Observable 2-coin Model: States are Hidden
Hidden Markov Models • Hidden Markov Models are characterized by • N, the number of states in the model • A, the state transition matrix • Observation probability distribution state • Initial state distribution, • Model Parameter Set:
Left-Right HMM • Can only transition to a higher state or stay in the same state • No-skip constraint allows states to transition only to the next state or remain the same state • Zeros in the state transition matrix represent illegal state transitions 4-state left-right HMM with no skip transitions
Isolated-Word Speech Recognition • Recognize one word at a time • Assume incoming signal is of the form: • silence – speech – silence • Feature Analysis • Training • Unit Matching • Recognition
Feature Analysis • We perform feature analysis to extract observation vectors upon which all processing will be performed • The discrete-time speech signal is with discrete Fourier transform • To reduce the dimensionality of the V-dim speech vector, we use cepstral coefficients, which serve as the feature observation vector for all future processing
Cepstral Coefficients • Feature vectors are cepstral coefficients obtained from the sampled speech vector • where is the periodogram estimate of the power spectral density of the speech • We eliminate the zeroth component and keep cepstral coefficients 1 through L-1 • Dimensionality reduction =
Properties of Cepstral Coefficients • Serve to undo the convolution between the pitch and the vocal tract • High-order cepstral components carry speaker dependent pitch information, which is not relevant for speech recognition • Cepstral coefficients are well approximated by a Gaussian probability density function (pdf) • Correlation values of cepstral coefficients are very low
Modeling of Cepstral Coefficients • HMM assumes that the Markovian states generate the cepstral vectors • Each state represents a Gaussian source with mean vector and covariance matrix • Each feature vector of cepstral coefficients can be modeled as a sample vector of an L-dim Gaussian random vector with mean vector and diagonal covariance matrix
Unit Matching • Initial Goal: obtain an HMM for each speech recognition unit • Large vocabulary (300 words): recognition units are phonemes • Small-vocabulary (10 words): recognition units are words • We will consider an isolated-word speech recognition system for a small vocabulary of M words
Notation • Observation vector is , where each is a cepstral feature vector and is the number of feature vectors in an observation • State Sequence is , where each • State index • Word index • Time index • The term model will be used for both the HMM and the parameter set describing the HMM,
Training • We need to obtain an HMM for each of the M words • The process of building the HMMs is called training • Each HMM is characterized by the number of states, N, and the model parameter set, • Each cepstral feature vector, , in state, , can be modeled by an L-dim Gaussian pdf • where is the mean vector and is the covariance matrix in state
Training • A Gaussian pdf is completely characterized by the mean vector and covariance matrix • The model parameter set can be modified to • The training procedure is the same for each word. For convenience, we will drop the subscript from
Building the HMM • To build the HMM, we need to determine the parameter set that maximizes the likelihood of the observation for that word. • Objective: • The double maximization can be performed by optimizing over the state sequence and the model individually
Uniform Segmentation Determining the initial state sequence 50 segments 8 states
Maximization over the Model • Given the initial state sequence, we maximize over the model • The maximization entails estimating the model parameters from the observation given the state sequence • Estimation is performed using the Baum-Welch re-estimation formulas
State transition matrix Mean vector per state Covariance matrix per state Initial state distribution Re-estimation Formulas is the number of feature vectors in state
Maximization over the state sequence • Given the model, we maximize over the state sequence • The probability expression can be rewritten as
Maximization over the state sequence • Applying the logarithm transforms the maximization of a product into a maximization of a sum • We are still looking for the state sequence that maximizes the expression • The optimal state sequence can be determined using the Viterbi algorithm
Trellis Structure of HMMs • Redrawing the HMM as a trellis makes it easy to see the state sequence as a path through the trellis • The optimal state sequence is determined by the Viterbi algorithm as the single best path that maximizes
Training Procedure Uniform Segmentation Cepstral Calculation Estimation of (Baum-Welch) State Sequence Segmentation (Viterbi) No Converged? Yes
Recognition • We have a set of HMMs, one for each word • Objective: Choose the word model that maximizes the probability of the observation given the model (Maximum Likelihood detection rule) • Classifier for observation is • The likelihood can be written as a summation over all state sequences
Recognition • Replace the full likelihood by an approximation that takes into account only the most probable state sequence capable of producing the observation • Treating the most probable state sequence as the best path in the HMM trellis allows us to use the Viterbi algorithm to maximize the above probability • The best-path classifier for observation is
Recognition Index of recognized word Cepstral Calculation Select Maximum
Conclusion • Introduced hidden Markov models • Described process of isolated-word speech recognition ¤ Feature vectors ¤Unit matching ¤ Unit matching Training ¤ Recognition • Other considerations • ¤ Artificial Neural Networks (ANNs) for speech recognition • ¤ Hybrid HMM/ANN models • ¤ Minimum classification error HMM design