220 likes | 318 Views
Discriminative Learning for Hidden Markov Models. Li Deng. Microsoft Research. EE 516; UW Spring 2009. Minimum Classification Error (MCE). The objective function of MCE training is a smoothed recognition error rate.
E N D
Discriminative Learning for Hidden Markov Models Li Deng Microsoft Research EE 516; UW Spring 2009
Minimum Classification Error (MCE) • The objective function of MCE training is a smoothed recognition error rate. • Traditionally, MCE criterion is optimized through stochastic gradient descent (e.g., GPD) • In this work we proposed the Growth Transformation based method for MCE based model estimation
Automatic Speech Recognition (ASR) Speech recognition:
Models (feature functions) in ASR ASR in the log-linear framework Λ is the parameter set of the acoustic model (HMM), which is of interest at MCE training in this work.
MCE: Mis-classification measure Define misclassification measure: (in the case of using correct and top one incorrect competing tokens) sr,1: the top one incorrect (not equal to Sr) competing string
MCE: Loss function Classification: Classifi. error: dr(Xr,Λ) > 0 1 classification error dr(Xr,Λ) < 0 0 classification error Loss function: smoothed error count func.
MCE: Objective function MCE objective function: LMCE(Λ) is the smoothed recognition error rate on the string (token) level. Model (acoustic model) is trained to minimizeLMCE(Λ), i.e., Λ* = argminΛ{LMCE(Λ)}
MCE: Optimization • Growth Transformation based MCE: If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then T(∙) is called a growth transformation ofΛ for P(Λ). Maximizing F(Λ;Λ′) = G-P′×H+D Maximizing P(Λ) = G(Λ)/H(Λ) Minimizing LMCE(Λ) = ∑l﴾d(∙)﴿ GT formula ∂U(∙)/∂Λ = 0 Λ =T(Λ′) Maximizing U(Λ;Λ′) = ∑f′(∙)log f(∙) Maximizing F(Λ;Λ′) = ∑ f(∙)
MCE: Optimization Re-write MCE loss function to Then, min. LMCE(Λ) max. Q(Λ), where
MCE: Optimization Q(Λ) is further re-formulated to a single fractional function P(Λ) where
MCE: Optimization Increasing P(Λ) can be achieved by maximizing as long as D is a Λ-independent constant. i.e., (Λ′ is the parameter set obtained from last iteration) Substitute G() and H() into F(),
MCE: Optimization Reformulate F(Λ;Λ') to where F(Λ;Λ') is ready for EM style optimization Note: Γ(Λ′) is a constant, andlog p(χ, q | s, Λ)is easy to decompose.
MCE: Optimization Increasing F(Λ;Λ') can be achieved by maximizing Use extend Baum-Welch for E step. log f(χ,q,s,Λ;Λ') is decomposable w.r.t Λ, so M step is easy to compute. So the growth transformation of Λ for CDHMM is:
MCE: Model estimation formulas For Gaussian mixture CDHMM, GT of mean and covariance of Gaussian m is where
MCE: Model estimation formulas Setting of Dm Theoretically, setDm so that f(χ,q,s,Λ;Λ') > 0 Empirically,
MCE: Workflow Training utterances Last iteration ModelΛ′ Recognition Competing strings Training transcripts GT-MCE next iteration New model Λ
Experiment: TI-DIGITS • Vocabulary: “1” to “9”, plus “oh” and “zero” • Training set: 8623 utterances / 28329 words • Test set: 8700 utterances / 28583 words • 33-dimentional spectrum feature: energy +10 MFCCs, plus ∆ and ∆∆ features. • Model: Continuous Density HMMs • Total number of Gaussian components: 3284
Experiment: TI-DIGITS GT-MCE vs. ML (maximum likelihood) baseline Obtain the lowest error rate on this task Reduce recognition Word Error Rate (WER) by 23% Fast and stable convergence
Experiment: Microsoft Tele. ASR • Microsoft Speech Server – ENUTEL • A telephony speech recognition system • Training set: 2000 hour speech / 2.7 million utterances • 33-dim spectrum features: (E+MFCCs) +∆ +∆∆ • Acoustic Model: Gaussian mixture HMM • Total number of Gaussian components: 100K • Vocabulary: 120K (delivered vendor lexicon) • CPU Cluster: 100 CPUs @ 1.8GHz – 3.4GHz • Training Cost: 4~5 hours per iteration
Experiment: Microsoft Tele. ASR • Evaluate on four corpus-independent tests • Collected from sites other than training data providers • Cover major commercial Tele. ASR scenarios
Experiment: Microsoft Tele. ASR Significant performance improvements across-the-board The first time MCE is successfully applied to a 2000 hr. speech database The Growth Transformation based MCE training is well suited for large scale modeling tasks