Dan Povey, George Saon, Lidia Mangu, Brian Kingsbury & Geoffrey Zweig

EARS Progress Update:Improved MPE, Inline Lattice Rescoring, Fast Decoding, Gaussianization & Fisher experiments Dan Povey, George Saon, Lidia Mangu, Brian Kingsbury & Geoffrey Zweig Title slide

Part 1: Improved MPE

Previous discriminative training setup –Implicit Lattice MMI • Used unigram decoding graph and fast decoding to generate state-level “posteriors” (actually relative likelihoods: delta between best path using the state and best path overall) • Posteriors used directly (without forward-backward) to accumulate “denominator” statistics. • Numerator statistics accumulated as for ML training, with full forward-backward • Fairly effective but not “MMI/MPE standard” Agenda slide

Current discriminative training setup (for standard MMI) • Creating lattices with unigram scores on links • Forward-backward on lattices (using fixed state sequence) to get occupation probabilities, use same lattices on multiple iterations • Creating num + den stats in a consistent way • Use slower training speed (E=2, not 1) and more iterations • Also implemented MPE Agenda slide

Experimental conditions • Same as for RT’03 evaluation • 274 hours of Switchboard training data • Training + test data adapted using FMLLR transform [from ML system] • 60dim PLPs, VTLN, no MLLR Agenda slide

Basic MMI results (eval’00) • With word-internal phone context, 142K Gaussians • 1.4% more improvement (2.7% total) with this setup Agenda slide

MPE results (eval’00) Subheading: 20pt Arial Italicsteal R045 | G182 | B179 • Standard MPE is not as good as MMI with this setup • “MPE+MMI”, which is MPE with I-smoothing to MMI update (not ML), gives 0.5% absolute over MMI Highlight: 18pt Arial Italic,teal R045 | G182 | B179 * Conditions differ, treat with caution. Text slide withsubheading and highlight

MPE+MMI continued • “MPE+MMI” involves storing 4 sets of statistics rather than 3: num, den, ml and now also mmi-den. 33% more storage, no extra computation • Do standard MMI update using ml and mmi-den stats, use resulting mean & var in place of ML mean & var in I-smoothing. • (Note- I-smoothing is a kind of gradual backoff to a more robust estimate of mean & variance). Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Probability scaling in MPE • MPE training leads to an excess of deletions. • Based on previous experience, this can be due to a probability scale that is too extreme. • Changing the probability scale from 1/18 to 1/10 gave a ~0.3% win. • 1/10 used as scale on all MPE experiments with left-context (see later) Highlight: 18pt Arial Italic,teal R045 | G182 | B179 Text slide withsubheading and highlight

Fast MMI • Work presented by Bill Byrne at Eurospeech’03 showed improved results from MMI where the correctly recognized data was excluded* • Achieve a similar effect without hard decisions, by canceling num & den stats • I.e., if a state has nonzero occupation probabilities for both numerator and denominator at time t, cancel the shared part so only one is positive. • Gives as good or better results as baseline, with half the iterations. • Use E=2 as before. Subheading: 20pt Arial Italicsteal R045 | G182 | B179 * “Lattice segmentation and Minimum Bayes Risk Discriminative Training”, Vlasios Doumpiotis et. al, Eurospeech 2003 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 Text slide withsubheading and highlight

MMI+MPE with cross-word (left) phone context • Similar size system (about 160K vs 142K), with cross-word context • Results shown here connect word-traces into lattices indiscriminately (ignoring constraints of context) • There is an additional win possible from using context constraints (~0.2%) Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 *I.e. last year, different setup Text slide withsubheading and highlight

MMI and MPE with cross-word context.. on RT’03 • The new MMI setup (including ‘fast MMI’) is no better than old MMI • About 1.8% improvement on RT’03 from MPE+MMI; MPE alone gives 1.4% improvement. • Those numbers are 2.5% and 2.0% on RT’00 • Comparison with MPE results in Cambridge’s 28-mix system (~170K Gaussians) from 2002: • Most comparable number is 2.2% improvement (30.4% to 28.2%) on dev01sub using FMLLR (“constrained MLLR”) and F-SAT training (*) Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 • “Automatic transcription of conversational telephone speech”, T. Hain et. al, submitted to IEEE transactions on Speech & Audio processing Text slide withsubheading and highlight

Part 2: Inline Lattice Rescoring

Language model rescoring – some preliminary work • Very large LMs help, e.g. moving from a typical to huge (unpruned) LM can help by 0.8% (*) • Very hard to build static decoding graphs for huge LMs • Good to be able to efficiently rescore lattices with a different LM • Also useful for adaptive language modeling • … adaptive language modeling gives us ~1% on “superhuman” test set, and 0.2% on RT’03 (+) * “Large LM”, Nikolai Duta & Richard Schwartz (BBN), presentation 2003 EARS meeting, IDIAP, Martigny + “Experiments on adaptive LM”, Lidia Mangu & Geoff Zweig (IBM), ibid.

Lattice rescoring algorithm • Taking a lattice and applying a 3 or 4-gram LM involves expanding lattice nodes • This algorithm can take very large amounts of time for some lattices • Can be solved by heavy pruning- but this is undesirable if LMs are quite different. •  Developed lattice LM-rescoring algorithm. • Finds the best path through a lattice given a different LM (*) *(We are working on a modified algorithm that will generate rescored lattices)

Lattice rescoring algorithm (cont’d) • Each word-instance in lattice has k tokens (e.g. k=3) • Each token has a partial word history ending in the current word, and a traceback to the best predecessor token WHY, -101 CAP, -310 WHEN THE, -205 WHY THE, -210 WHY CAP THE WHY THE CAT, -345 THE CAT, -310 WHEN, -101 WHEN CAT

Lattice rescoring algorithm (cont’d) • For each word-instance in lattice from left to right… • …for each token in each predecessor word-instance... • …...Add current word to that token’s word-history and work out LM & acoustic costs; • …... delete word left-context until the word-history exists in the LM as an LM context • …... Form a new token pointing back to predecessor token. • ……and add token to the current word-instance’s list of tokens. • Always ensure that no two tokens with the same word-history exist (delete the least likely one) • … and always keep only the k most likely tokens. • Finally, trace-back from most likely token at end of utterance. • All done within decoder • Highly efficient

Lattice rescoring algorithm – experiments • To verify that it works… • Took the 4-gram LM used for the RT’03 evaluation and pruned it 13-fold • Built a decoding graph, and rescored with original LM • Testing on RT’03, MPE-trained system with Gaussianization * See next slide Note, all experiments actually include an n-1 word history in each token, even when not necessary. This should decrease the accuracy of the algorithm for a given k.

Lattice rescoring algorithm – forward vs backward Lattice generation algorithm: • Both alpha and beta likelihoods are available to the algorithm • Whenever a word-end state likelihood is within delta of the best path… • Trace back until a word beginning state whose best predecessor is word-end, is reached • ...and create a “word trace.” • Join all these word traces to form a lattice (using graph connectivity constraints) • Equivalent to Julian Odell’s algorithm (with n=infinity) • BUT we also add “forwards” traces, based on tracing forward from word beginning to word end. Time-symmetric with backtraces. • There are fewer forwards traces (due to graph topology) • Adding forwards traces is important (0.6% hit from removing them) • I don’t believe there is much effect on lattice oracle WER. • … it is the alignments of word-sequences that are affected.

Part 3: Progress in Fast Decoding

RT’03 Sub-realtime Architecture Agenda slide

Improvements in Fast Decoding • Switched from rank pruning to running beam pruning • Hypotheses are pruned early on based on running max estimate during successor expansion then pruned again after final max states max update max update; pruned at the end prune based on current max-beam max update prune based on current max-beam t t+1 time Agenda slide

Runtime vs. WER: Beam and rank pruning Resulted in a 10% decoding speed-up without loss in accuracy Agenda slide

Reducing the memory requirements • Run-time memory reduction by storing minimum traceback information for Viterbi word sequence recovery • Previously we stored information for full state-level alignment • Now we store only information for word-level alignment • Alpha entry has accumulated cost and pointer to originating word token • Two alpha vectors for “flip-flop” • Permanent word-level tokens created only at active word-ends • No penalty in speed and dynamic memory reduction by two orders of magnitude Agenda slide

Part 4: Feature-Space Gaussianization

Feature space Gaussianization [Saon et al. 04] • Idea: transform each dimension non-linearly such that it becomes Gaussian distributed • Motivations: • Perform speaker adaptation with non-linear transforms • Natural form of non-linear speaker adaptive training (SAT) • Effort of modeling output distribution with GMMs is reduced Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 • Transform is given by the inverse gaussian CDF applied to the empirical CDF Text slide withsubheading and highlight

Inverse Gaussian CDF (mean 0, variance 1) 1 New data values, absolute +- 1 std dev 0 -1 68% 16 50 84 Old data values, percentile Feature Space Gaussianization, Pictorially

An actual transform Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 Text slide withsubheading and highlight

Feature space Gaussianization: WER • Results on RT’03 at the SAT level (no MLLR): Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 Text slide withsubheading and highlight

Part 5: Experiments with Fisher Data

Acoustic Training Data • Training set size based on aligned frames only. • Total is 829 hours of speech; 486 hours excluding Fisher. • Training vocabulary includes 61K tokens. • First experiments with Fisher 1-4. Iteration likely to improve results.

Effect of new Fisher data on WER • Systems are PLP VTLN SAT, 60-dim. LDA+MLLT features • One-shot decoding on IBM 2003 RT-03 LM (interpolated 4gm) for RT-03; generic interpolated 3gm for Superhuman. • Fisher data in AM only – not LM

Summary Discriminative training • New MPE 0.7% better than old MMI on RT03 • Used MMI estimate rather than ML estimate for I-smoothing with MPE (consistently gives about 0.4% improvement over standard MPE) LM rescoring • 10x Reduction in static graph size – 132M  10M • Useful for rescoring with adaptive LMs Fast Decoding • 10% speedup - incremental application of absolute pruning threshold Gaussianization • 0.6% improvement on top of MPE • Useful on a variety of tasks (e.g. C&C in cars) Fisher Data • 1.3% improvement over last year without it (AM only) • Not useful in a broader context

Dan Povey, George Saon, Lidia Mangu, Brian Kingsbury & Geoffrey Zweig