1 / 22

Discriminative Learning for Hidden Markov Models

Discriminative Learning for Hidden Markov Models. Li Deng. Microsoft Research. EE 516; UW Spring 2009. Minimum Classification Error (MCE). The objective function of MCE training is a smoothed recognition error rate.

tan
Download Presentation

Discriminative Learning for Hidden Markov Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009

  2. Minimum Classification Error (MCE) • The objective function of MCE training is a smoothed recognition error rate. • Traditionally, MCE criterion is optimized through stochastic gradient descent (e.g., GPD) • In this work we proposed the Growth Transformation based method for MCE based model estimation

  3. Automatic Speech Recognition (ASR) Speech recognition:

  4. Models (feature functions) in ASR ASR in the log-linear framework Λ is the parameter set of the acoustic model (HMM), which is of interest at MCE training in this work.

  5. MCE: Mis-classification measure Define misclassification measure: (in the case of using correct and top one incorrect competing tokens) sr,1: the top one incorrect (not equal to Sr) competing string

  6. MCE: Loss function Classification: Classifi. error: dr(Xr,Λ) > 0  1 classification error dr(Xr,Λ) < 0  0 classification error Loss function: smoothed error count func.

  7. MCE: Objective function MCE objective function: LMCE(Λ) is the smoothed recognition error rate on the string (token) level. Model (acoustic model) is trained to minimizeLMCE(Λ), i.e., Λ* = argminΛ{LMCE(Λ)}

  8. MCE: Optimization

  9. MCE: Optimization • Growth Transformation based MCE: If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then T(∙) is called a growth transformation ofΛ for P(Λ). Maximizing F(Λ;Λ′) = G-P′×H+D Maximizing P(Λ) = G(Λ)/H(Λ) Minimizing LMCE(Λ) = ∑l﴾d(∙)﴿ GT formula ∂U(∙)/∂Λ = 0  Λ =T(Λ′) Maximizing U(Λ;Λ′) = ∑f′(∙)log f(∙) Maximizing F(Λ;Λ′) = ∑ f(∙)

  10. MCE: Optimization Re-write MCE loss function to Then, min. LMCE(Λ) max. Q(Λ), where

  11. MCE: Optimization Q(Λ) is further re-formulated to a single fractional function P(Λ) where

  12. MCE: Optimization Increasing P(Λ) can be achieved by maximizing as long as D is a Λ-independent constant. i.e., (Λ′ is the parameter set obtained from last iteration) Substitute G() and H() into F(),

  13. MCE: Optimization Reformulate F(Λ;Λ') to where F(Λ;Λ') is ready for EM style optimization Note: Γ(Λ′) is a constant, andlog p(χ, q | s, Λ)is easy to decompose.

  14. MCE: Optimization Increasing F(Λ;Λ') can be achieved by maximizing Use extend Baum-Welch for E step. log f(χ,q,s,Λ;Λ') is decomposable w.r.t Λ, so M step is easy to compute. So the growth transformation of Λ for CDHMM is:

  15. MCE: Model estimation formulas For Gaussian mixture CDHMM, GT of mean and covariance of Gaussian m is where

  16. MCE: Model estimation formulas Setting of Dm Theoretically, setDm so that f(χ,q,s,Λ;Λ') > 0 Empirically,

  17. MCE: Workflow Training utterances Last iteration ModelΛ′ Recognition Competing strings Training transcripts GT-MCE next iteration New model Λ

  18. Experiment: TI-DIGITS • Vocabulary: “1” to “9”, plus “oh” and “zero” • Training set: 8623 utterances / 28329 words • Test set: 8700 utterances / 28583 words • 33-dimentional spectrum feature: energy +10 MFCCs, plus ∆ and ∆∆ features. • Model: Continuous Density HMMs • Total number of Gaussian components: 3284

  19. Experiment: TI-DIGITS GT-MCE vs. ML (maximum likelihood) baseline Obtain the lowest error rate on this task Reduce recognition Word Error Rate (WER) by 23% Fast and stable convergence

  20. Experiment: Microsoft Tele. ASR • Microsoft Speech Server – ENUTEL • A telephony speech recognition system • Training set: 2000 hour speech / 2.7 million utterances • 33-dim spectrum features: (E+MFCCs) +∆ +∆∆ • Acoustic Model: Gaussian mixture HMM • Total number of Gaussian components: 100K • Vocabulary: 120K (delivered vendor lexicon) • CPU Cluster: 100 CPUs @ 1.8GHz – 3.4GHz • Training Cost: 4~5 hours per iteration

  21. Experiment: Microsoft Tele. ASR • Evaluate on four corpus-independent tests • Collected from sites other than training data providers • Cover major commercial Tele. ASR scenarios

  22. Experiment: Microsoft Tele. ASR Significant performance improvements across-the-board The first time MCE is successfully applied to a 2000 hr. speech database The Growth Transformation based MCE training is well suited for large scale modeling tasks

More Related