1 / 36

Large Models for Large Corpora: preliminary findings

Large Models for Large Corpora: preliminary findings. Patrick Nguyen Ambroise Mutel Jean-Claude Junqua. Panasonic Speech Technology Laboratory (PSTL). … from RT-03S workshop. Lots of data helps Standard training can be done in reasonable time with current resources 10kh: it’s coming soon

alexa
Download Presentation

Large Models for Large Corpora: preliminary findings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large Models for Large Corpora:preliminary findings Patrick Nguyen Ambroise Mutel Jean-Claude Junqua Panasonic Speech Technology Laboratory (PSTL)

  2. … from RT-03S workshop • Lots of data helps • Standard training can be done in reasonable time with current resources • 10kh: it’s coming soon • Promises: • Change the paradigm • Use data more efficiently • Keep it simple

  3. The Dawn of a New Era? • Merely increasing the model size is insufficient • HMMs will live on • Layering above HMM classifiers will do • Change topology of training • Large models with no impact on decoding • Data-greedy algorithms • Only meaningful with large amounts

  4. Two approaches: changing topology • Greedy models • Syllable units • Same model size, but consume more info • Increase data / parameter ratio • Add linguistic info • Factorize training: increase model size • Bubble splitting (generalized SAT) • Almost no penalty in decoding • Split according to acoustics

  5. Syllable units • Supra-segmental info • Pronunciation modeling (subword units) • Literature blames lack of data • TDT+ coverage is limited by construction (all words are in the decoding lexicon) • Better alternative to n-phones

  6. What syllables? • NIST / Bill Fisher tsyl software (ignore ambi-syllabic phenomena) • “Planned speech” mode • Always a schwa in the lexicon (e.g. little) • Phones del/sub/ins + supra-seg info is good onset rhyme peak coda (Body, Tail)

  7. Syllable units: facts • Hybrid (syl + phones) • [Ganapathiraju, Goel, Picone, Corrada, Doddington, Kirchhoff, Ordowski, Wheatley; 1997] • Seeding with CD-phones works • [Sethy, Narayanan; 2003] • State tying works [PSTL] • Position-dependent syllables work [PSTL] • CD-syllables kind of works [PSTL] • ROVER should work • [Wu, Kingsbury, Morgan, Greenberg; 1998] • Full embedded re-estimation does not work [PSTL]

  8. Coverage and large corpus • Warning: biased by construction • In the lexicon: 15k syllables • About 15M syllables total (10M words) • Total: 1600h / 950h filtered 1 example 127 examples 14 examples

  9. Coverage

  10. Hybrid: backing off • Cannot train all syllables => back-off to phone sequence • “Context breaks”: context-dependency chain will break • Two kinds of back-off: abduction • True sequence: ae_b d_ah_k sh______ih______n • [Sethy+]: ae_b d_ah_k sh+ih sh-ih+n ih-n • [Doddington+]: ae_b d_ah_k k-sh+ih sh-ih+n ih-n • Tricky • We don’t care: state-tying and almost never backoff

  11. Seeding • Copy CD models instead of flat-starting • Problem at syllable boundary (context break) • CI < Seed syl < CD • Imposes constraints on topology ??-sh+ih sh-ih+n ih-n+??

  12. Seeding (results) • Mono-Gaussian models • Trend continues even with iterative split • CI: 69% WER • CD: 26% WER • Syl-flat: 41% WER • Syl-seed: 31% WER (CD init)

  13. State-tying • Data-driven approach • Backing-off is a problem => train all syllables • too many states/distributions • too little data (skewed distribution) • Same strategy as CD-phone: entropy merge • Can add info (pos, CD) w/o worrying about explosion in # of states

  14. State-tying (2) • Compression w/o performance loss • Phone-internal, state-internal bottom-up merge to limit computations • Count about 10 states per syllable (3.3 phones) • Pos-dep CI syllables • 6000syl model: 59k Gaussians: 38.7% WER • Merged to 6k: (6k Gaussians): 38.6% WER • Trend continues with iterative split

  15. Position-dependent syllables • Word-boundary info (cf triphones) • Example: • Worthiness: _w_er dh_iy n_ih_s • The: _dh_iy_ • Missing from [Sethy+] and [Dod.+] • Results: (3% absolute at every split) • Pos-indep (6k syl): 39.2% WER (2dps) • Pos-dep: 35.7% WER (2dps)

  16. CD-Syllables • Inter-syllable context breaks • Context = next phone • Next syl? Next vowel? (Nucleus/peak) • CD(phone)-syl >= CD triphones • Small gains! • All GD results • CI-syl (6k syl): 19.0% WER • CD-syl (20k syl): 18.5% WER • CD-phones: 18.9% WER

  17. Segmentation • Word and Subword units give poor segmentation • Speaker-adapted overgrown CD-phones are always better • Problem for: MMI and adaptation • Results: (ML) • Word-internal: 21.8% WER • Syl-internal: 19.9% WER

  18. MMI/ADP didn’t work well • MMI: time-constrained to +/- 3ms within word boundary • Blame it on the segmentation (Word-int)

  19. ROVER • Two different systems can be combined • Two-pass “transatlantic” ROVER architecture • CD-phones align, phonetic classes • No gain (broken confidence), deletions • MMI+adp: 15.5% (CDp) and 16.0% (SY) • Best ROVER: 15.5% WER (4-pass, 2-way) Syllable models Adapted CD-phones

  20. Summary: architecture CD-phones POS-CI syllables Merged (6k) POS-CD syllables GD / MMI Merged (3k) Decode Adapt+decode ROVER

  21. Conclusion (Syllable) • Observed similar effects as literature • Added some observations (state tying, CD, pos, ADP/MMI) • Performance does not beat CD-phones yet • CD phones: 15.5% WER ; syl: 16.0% WER • Some assumptions might cancel the benefit of syllable modeling

  22. Open questions • Is syllabification (grouping) better than random? Syllable? • Planned vs spontaneous speech? • Did we oversimplify? • Why do subword units resist to auto-segmentation? • Why didn’t CD-syl work better? • Language-dependent effects

  23. Bubble Splitting • Outgrowth of SAT • Increase model-size 15-fold w/o computational penalty in train/decode • Also covers VTLN implementation • Basic idea: • Split training into locally homogenous regions (bubbles), and then apply SAT

  24. SAT vs Bubble Splitting • SAT relies on locally linearly compactable variabilities • Each Bubble has local variability • Simple acoustic factorization Bubble SAT Adaptation (MLLR)

  25. TDT and speakers labels • TDT is not speaker-labeled • Hub4 has 2400 nominative speakers • Use decoding clustering (show-internal clusters) • Males: 33k speakers • Females: 18k speakers • Shows: 2137 (TDT) + 288 (Hub4)

  26. Input Speech M A L E SPLIT ADAPT Maximum Likelihood Multiplex NORMALIZE TDT F E M A L E NORMALIZE SPLIT ADAPT Decoded Words Bubble-Splitting: Overview Compact Bubble Models (CBM)

  27. VTLN implementation • VTLN is used for clustering • VTLN is a linear feature transformation (almost) • Finding the best warp

  28. VTLN: Linear equivalence • According to [Pitz, ICSLP 2000], VTLN is equivalent to a linear transformation in the cepstral domain: • The relationship between a cepstral coefficient ck and a warped one (stretched or compressed) is as follows: • The Authors didn’t take the Mel-scale into account. No closed-form solution in that case : • Energy, Filter-banks, and cepstral liftering imply non-linear effects

  29. Decode with Aiand λ VTLN is linear Decoding Algorithm: INPUT SPEECH DECODED WORDS Experimental results:

  30. 1.00 0.98 3 Q evaluations: N1.00=3 4 Q evaluations: N0.98 = N1.02 = 4 Statistical multiplex • GD-mode • Faster than Brent search • 3 times faster than exhaustive search • Based on prior distribution of alpha • Test 0.98, 1, and 1.02 • If 0.98 wins, continue

  31. Bubble Splitting: Principle Partial center: satmodel λi Training Speaker • Separate conditions • VTLN • Train Bubble model • Compact using SAT • Feature-space SAT Bubble Bi SAT works on homogenous conditions

  32. Results About 0.5% WER reduction Double model size => 0.3% WER

  33. Conclusion (Bubble) • Gain: 0.5% WER • Extension of SAT model compaction • VTLN implementation more efficient

  34. Open questions • Baseline SAT does not work? • Speaker definition? • Best splitting strategy? (One per warp) • Best decoding strategy? (Closest warp) • Best bubble training? (MAP/MLLR) • MMIE

  35. Conclusion • What do we do with all of these data? • Syllable + bubble splitting • Two narrowly explored paths among many • Promising results but nothing breathtaking • Not ambitious enough?

  36. System setup • RT03eval • 6x RT • Same parameters as RT03S eval system • WI triphones, gender dependent, MMI • 2pass • Global MLLU + 7-class MLLR • 39 MFCC + non-causal CMS (2s) • 192k Gaussians, 3400 mixtures • 128 Gaussians / mix => merged

More Related