1 / 32

Dan Povey, George Saon, Lidia Mangu, Brian Kingsbury & Geoffrey Zweig

EARS Progress Update: Improved MPE, Inline Lattice Rescoring, Fast Decoding, Gaussianization & Fisher experiments. Dan Povey, George Saon, Lidia Mangu, Brian Kingsbury & Geoffrey Zweig. Title slide. Part 1: Improved MPE. Previous discriminative training setup – Implicit Lattice MMI .

gibson
Download Presentation

Dan Povey, George Saon, Lidia Mangu, Brian Kingsbury & Geoffrey Zweig

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EARS Progress Update:Improved MPE, Inline Lattice Rescoring, Fast Decoding, Gaussianization & Fisher experiments Dan Povey, George Saon, Lidia Mangu, Brian Kingsbury & Geoffrey Zweig Title slide

  2. Part 1: Improved MPE

  3. Previous discriminative training setup –Implicit Lattice MMI • Used unigram decoding graph and fast decoding to generate state-level “posteriors” (actually relative likelihoods: delta between best path using the state and best path overall) • Posteriors used directly (without forward-backward) to accumulate “denominator” statistics. • Numerator statistics accumulated as for ML training, with full forward-backward • Fairly effective but not “MMI/MPE standard” Agenda slide

  4. Current discriminative training setup (for standard MMI) • Creating lattices with unigram scores on links • Forward-backward on lattices (using fixed state sequence) to get occupation probabilities, use same lattices on multiple iterations • Creating num + den stats in a consistent way • Use slower training speed (E=2, not 1) and more iterations • Also implemented MPE Agenda slide

  5. Experimental conditions • Same as for RT’03 evaluation • 274 hours of Switchboard training data • Training + test data adapted using FMLLR transform [from ML system] • 60dim PLPs, VTLN, no MLLR Agenda slide

  6. Basic MMI results (eval’00) • With word-internal phone context, 142K Gaussians • 1.4% more improvement (2.7% total) with this setup Agenda slide

  7. MPE results (eval’00) Subheading: 20pt Arial Italicsteal R045 | G182 | B179 • Standard MPE is not as good as MMI with this setup • “MPE+MMI”, which is MPE with I-smoothing to MMI update (not ML), gives 0.5% absolute over MMI Highlight: 18pt Arial Italic,teal R045 | G182 | B179 * Conditions differ, treat with caution. Text slide withsubheading and highlight

  8. MPE+MMI continued • “MPE+MMI” involves storing 4 sets of statistics rather than 3: num, den, ml and now also mmi-den. 33% more storage, no extra computation • Do standard MMI update using ml and mmi-den stats, use resulting mean & var in place of ML mean & var in I-smoothing. • (Note- I-smoothing is a kind of gradual backoff to a more robust estimate of mean & variance). Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Probability scaling in MPE • MPE training leads to an excess of deletions. • Based on previous experience, this can be due to a probability scale that is too extreme. • Changing the probability scale from 1/18 to 1/10 gave a ~0.3% win. • 1/10 used as scale on all MPE experiments with left-context (see later) Highlight: 18pt Arial Italic,teal R045 | G182 | B179 Text slide withsubheading and highlight

  9. Fast MMI • Work presented by Bill Byrne at Eurospeech’03 showed improved results from MMI where the correctly recognized data was excluded* • Achieve a similar effect without hard decisions, by canceling num & den stats • I.e., if a state has nonzero occupation probabilities for both numerator and denominator at time t, cancel the shared part so only one is positive. • Gives as good or better results as baseline, with half the iterations. • Use E=2 as before. Subheading: 20pt Arial Italicsteal R045 | G182 | B179 * “Lattice segmentation and Minimum Bayes Risk Discriminative Training”, Vlasios Doumpiotis et. al, Eurospeech 2003 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 Text slide withsubheading and highlight

  10. MMI+MPE with cross-word (left) phone context • Similar size system (about 160K vs 142K), with cross-word context • Results shown here connect word-traces into lattices indiscriminately (ignoring constraints of context) • There is an additional win possible from using context constraints (~0.2%) Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 *I.e. last year, different setup Text slide withsubheading and highlight

  11. MMI and MPE with cross-word context.. on RT’03 • The new MMI setup (including ‘fast MMI’) is no better than old MMI • About 1.8% improvement on RT’03 from MPE+MMI; MPE alone gives 1.4% improvement. • Those numbers are 2.5% and 2.0% on RT’00 • Comparison with MPE results in Cambridge’s 28-mix system (~170K Gaussians) from 2002: • Most comparable number is 2.2% improvement (30.4% to 28.2%) on dev01sub using FMLLR (“constrained MLLR”) and F-SAT training (*) Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 • “Automatic transcription of conversational telephone speech”, T. Hain et. al, submitted to IEEE transactions on Speech & Audio processing Text slide withsubheading and highlight

  12. Part 2: Inline Lattice Rescoring

  13. Language model rescoring – some preliminary work • Very large LMs help, e.g. moving from a typical to huge (unpruned) LM can help by 0.8% (*) • Very hard to build static decoding graphs for huge LMs • Good to be able to efficiently rescore lattices with a different LM • Also useful for adaptive language modeling • … adaptive language modeling gives us ~1% on “superhuman” test set, and 0.2% on RT’03 (+) * “Large LM”, Nikolai Duta & Richard Schwartz (BBN), presentation 2003 EARS meeting, IDIAP, Martigny + “Experiments on adaptive LM”, Lidia Mangu & Geoff Zweig (IBM), ibid.

  14. Lattice rescoring algorithm • Taking a lattice and applying a 3 or 4-gram LM involves expanding lattice nodes • This algorithm can take very large amounts of time for some lattices • Can be solved by heavy pruning- but this is undesirable if LMs are quite different. •  Developed lattice LM-rescoring algorithm. • Finds the best path through a lattice given a different LM (*) *(We are working on a modified algorithm that will generate rescored lattices)

  15. Lattice rescoring algorithm (cont’d) • Each word-instance in lattice has k tokens (e.g. k=3) • Each token has a partial word history ending in the current word, and a traceback to the best predecessor token WHY, -101 CAP, -310 WHEN THE, -205 WHY THE, -210 WHY CAP THE WHY THE CAT, -345 THE CAT, -310 WHEN, -101 WHEN CAT

  16. Lattice rescoring algorithm (cont’d) • For each word-instance in lattice from left to right… • …for each token in each predecessor word-instance... • …...Add current word to that token’s word-history and work out LM & acoustic costs; • …... delete word left-context until the word-history exists in the LM as an LM context • …... Form a new token pointing back to predecessor token. • ……and add token to the current word-instance’s list of tokens. • Always ensure that no two tokens with the same word-history exist (delete the least likely one) • … and always keep only the k most likely tokens. • Finally, trace-back from most likely token at end of utterance. • All done within decoder • Highly efficient

  17. Lattice rescoring algorithm – experiments • To verify that it works… • Took the 4-gram LM used for the RT’03 evaluation and pruned it 13-fold • Built a decoding graph, and rescored with original LM • Testing on RT’03, MPE-trained system with Gaussianization * See next slide Note, all experiments actually include an n-1 word history in each token, even when not necessary. This should decrease the accuracy of the algorithm for a given k.

  18. Lattice rescoring algorithm – forward vs backward Lattice generation algorithm: • Both alpha and beta likelihoods are available to the algorithm • Whenever a word-end state likelihood is within delta of the best path… • Trace back until a word beginning state whose best predecessor is word-end, is reached • ...and create a “word trace.” • Join all these word traces to form a lattice (using graph connectivity constraints) • Equivalent to Julian Odell’s algorithm (with n=infinity) • BUT we also add “forwards” traces, based on tracing forward from word beginning to word end. Time-symmetric with backtraces. • There are fewer forwards traces (due to graph topology) • Adding forwards traces is important (0.6% hit from removing them) • I don’t believe there is much effect on lattice oracle WER. • … it is the alignments of word-sequences that are affected.

  19. Part 3: Progress in Fast Decoding

  20. RT’03 Sub-realtime Architecture Agenda slide

  21. Improvements in Fast Decoding • Switched from rank pruning to running beam pruning • Hypotheses are pruned early on based on running max estimate during successor expansion then pruned again after final max states max update max update; pruned at the end prune based on current max-beam max update prune based on current max-beam t t+1 time Agenda slide

  22. Runtime vs. WER: Beam and rank pruning Resulted in a 10% decoding speed-up without loss in accuracy Agenda slide

  23. Reducing the memory requirements • Run-time memory reduction by storing minimum traceback information for Viterbi word sequence recovery • Previously we stored information for full state-level alignment • Now we store only information for word-level alignment • Alpha entry has accumulated cost and pointer to originating word token • Two alpha vectors for “flip-flop” • Permanent word-level tokens created only at active word-ends • No penalty in speed and dynamic memory reduction by two orders of magnitude Agenda slide

  24. Part 4: Feature-Space Gaussianization

  25. Feature space Gaussianization [Saon et al. 04] • Idea: transform each dimension non-linearly such that it becomes Gaussian distributed • Motivations: • Perform speaker adaptation with non-linear transforms • Natural form of non-linear speaker adaptive training (SAT) • Effort of modeling output distribution with GMMs is reduced Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 • Transform is given by the inverse gaussian CDF applied to the empirical CDF Text slide withsubheading and highlight

  26. Inverse Gaussian CDF (mean 0, variance 1) 1 New data values, absolute +- 1 std dev 0 -1 68% 16 50 84 Old data values, percentile Feature Space Gaussianization, Pictorially

  27. An actual transform Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 Text slide withsubheading and highlight

  28. Feature space Gaussianization: WER • Results on RT’03 at the SAT level (no MLLR): Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 Text slide withsubheading and highlight

  29. Part 5: Experiments with Fisher Data

  30. Acoustic Training Data • Training set size based on aligned frames only. • Total is 829 hours of speech; 486 hours excluding Fisher. • Training vocabulary includes 61K tokens. • First experiments with Fisher 1-4. Iteration likely to improve results.

  31. Effect of new Fisher data on WER • Systems are PLP VTLN SAT, 60-dim. LDA+MLLT features • One-shot decoding on IBM 2003 RT-03 LM (interpolated 4gm) for RT-03; generic interpolated 3gm for Superhuman. • Fisher data in AM only – not LM

  32. Summary Discriminative training • New MPE 0.7% better than old MMI on RT03 • Used MMI estimate rather than ML estimate for I-smoothing with MPE (consistently gives about 0.4% improvement over standard MPE) LM rescoring • 10x Reduction in static graph size – 132M  10M • Useful for rescoring with adaptive LMs Fast Decoding • 10% speedup - incremental application of absolute pruning threshold Gaussianization • 0.6% improvement on top of MPE • Useful on a variety of tasks (e.g. C&C in cars) Fisher Data • 1.3% improvement over last year without it (AM only) • Not useful in a broader context

More Related