Discriminative Learning for Hidden Markov Models

Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009

Minimum Classification Error (MCE) • The objective function of MCE training is a smoothed recognition error rate. • Traditionally, MCE criterion is optimized through stochastic gradient descent (e.g., GPD) • In this work we proposed the Growth Transformation based method for MCE based model estimation

Automatic Speech Recognition (ASR) Speech recognition:

Models (feature functions) in ASR ASR in the log-linear framework Λ is the parameter set of the acoustic model (HMM), which is of interest at MCE training in this work.

MCE: Mis-classification measure Define misclassification measure: (in the case of using correct and top one incorrect competing tokens) sr,1: the top one incorrect (not equal to Sr) competing string

MCE: Loss function Classification: Classifi. error: dr(Xr,Λ) > 0  1 classification error dr(Xr,Λ) < 0  0 classification error Loss function: smoothed error count func.

MCE: Objective function MCE objective function: LMCE(Λ) is the smoothed recognition error rate on the string (token) level. Model (acoustic model) is trained to minimizeLMCE(Λ), i.e., Λ* = argminΛ{LMCE(Λ)}

MCE: Optimization

MCE: Optimization • Growth Transformation based MCE: If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then T(∙) is called a growth transformation ofΛ for P(Λ). Maximizing F(Λ;Λ′) = G-P′×H+D Maximizing P(Λ) = G(Λ)/H(Λ) Minimizing LMCE(Λ) = ∑l﴾d(∙)﴿ GT formula ∂U(∙)/∂Λ = 0  Λ =T(Λ′) Maximizing U(Λ;Λ′) = ∑f′(∙)log f(∙) Maximizing F(Λ;Λ′) = ∑ f(∙)

MCE: Optimization Re-write MCE loss function to Then, min. LMCE(Λ) max. Q(Λ), where

MCE: Optimization Q(Λ) is further re-formulated to a single fractional function P(Λ) where

MCE: Optimization Increasing P(Λ) can be achieved by maximizing as long as D is a Λ-independent constant. i.e., (Λ′ is the parameter set obtained from last iteration) Substitute G() and H() into F(),

MCE: Optimization Reformulate F(Λ;Λ') to where F(Λ;Λ') is ready for EM style optimization Note: Γ(Λ′) is a constant, andlog p(χ, q | s, Λ)is easy to decompose.

MCE: Optimization Increasing F(Λ;Λ') can be achieved by maximizing Use extend Baum-Welch for E step. log f(χ,q,s,Λ;Λ') is decomposable w.r.t Λ, so M step is easy to compute. So the growth transformation of Λ for CDHMM is:

MCE: Model estimation formulas For Gaussian mixture CDHMM, GT of mean and covariance of Gaussian m is where

MCE: Model estimation formulas Setting of Dm Theoretically, setDm so that f(χ,q,s,Λ;Λ') > 0 Empirically,

MCE: Workflow Training utterances Last iteration ModelΛ′ Recognition Competing strings Training transcripts GT-MCE next iteration New model Λ

Experiment: TI-DIGITS • Vocabulary: “1” to “9”, plus “oh” and “zero” • Training set: 8623 utterances / 28329 words • Test set: 8700 utterances / 28583 words • 33-dimentional spectrum feature: energy +10 MFCCs, plus ∆ and ∆∆ features. • Model: Continuous Density HMMs • Total number of Gaussian components: 3284

Experiment: TI-DIGITS GT-MCE vs. ML (maximum likelihood) baseline Obtain the lowest error rate on this task Reduce recognition Word Error Rate (WER) by 23% Fast and stable convergence

Experiment: Microsoft Tele. ASR • Microsoft Speech Server – ENUTEL • A telephony speech recognition system • Training set: 2000 hour speech / 2.7 million utterances • 33-dim spectrum features: (E+MFCCs) +∆ +∆∆ • Acoustic Model: Gaussian mixture HMM • Total number of Gaussian components: 100K • Vocabulary: 120K (delivered vendor lexicon) • CPU Cluster: 100 CPUs @ 1.8GHz – 3.4GHz • Training Cost: 4~5 hours per iteration

Experiment: Microsoft Tele. ASR • Evaluate on four corpus-independent tests • Collected from sites other than training data providers • Cover major commercial Tele. ASR scenarios

Experiment: Microsoft Tele. ASR Significant performance improvements across-the-board The first time MCE is successfully applied to a 2000 hr. speech database The Growth Transformation based MCE training is well suited for large scale modeling tasks

Discriminative Learning for Hidden Markov Models

Discriminative Learning for Hidden Markov Models

Presentation Transcript

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Discriminative Learning for Hidden Markov Models

Active Learning for Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models