Large Models for Large Corpora: preliminary findings

Large Models for Large Corpora:preliminary findings Patrick Nguyen Ambroise Mutel Jean-Claude Junqua Panasonic Speech Technology Laboratory (PSTL)

… from RT-03S workshop • Lots of data helps • Standard training can be done in reasonable time with current resources • 10kh: it’s coming soon • Promises: • Change the paradigm • Use data more efficiently • Keep it simple

The Dawn of a New Era? • Merely increasing the model size is insufficient • HMMs will live on • Layering above HMM classifiers will do • Change topology of training • Large models with no impact on decoding • Data-greedy algorithms • Only meaningful with large amounts

Two approaches: changing topology • Greedy models • Syllable units • Same model size, but consume more info • Increase data / parameter ratio • Add linguistic info • Factorize training: increase model size • Bubble splitting (generalized SAT) • Almost no penalty in decoding • Split according to acoustics

Syllable units • Supra-segmental info • Pronunciation modeling (subword units) • Literature blames lack of data • TDT+ coverage is limited by construction (all words are in the decoding lexicon) • Better alternative to n-phones

What syllables? • NIST / Bill Fisher tsyl software (ignore ambi-syllabic phenomena) • “Planned speech” mode • Always a schwa in the lexicon (e.g. little) • Phones del/sub/ins + supra-seg info is good onset rhyme peak coda (Body, Tail)

Syllable units: facts • Hybrid (syl + phones) • [Ganapathiraju, Goel, Picone, Corrada, Doddington, Kirchhoff, Ordowski, Wheatley; 1997] • Seeding with CD-phones works • [Sethy, Narayanan; 2003] • State tying works [PSTL] • Position-dependent syllables work [PSTL] • CD-syllables kind of works [PSTL] • ROVER should work • [Wu, Kingsbury, Morgan, Greenberg; 1998] • Full embedded re-estimation does not work [PSTL]

Coverage and large corpus • Warning: biased by construction • In the lexicon: 15k syllables • About 15M syllables total (10M words) • Total: 1600h / 950h filtered 1 example 127 examples 14 examples

Coverage

Hybrid: backing off • Cannot train all syllables => back-off to phone sequence • “Context breaks”: context-dependency chain will break • Two kinds of back-off: abduction • True sequence: ae_b d_ah_k sh______ih______n • [Sethy+]: ae_b d_ah_k sh+ih sh-ih+n ih-n • [Doddington+]: ae_b d_ah_k k-sh+ih sh-ih+n ih-n • Tricky • We don’t care: state-tying and almost never backoff

Seeding • Copy CD models instead of flat-starting • Problem at syllable boundary (context break) • CI < Seed syl < CD • Imposes constraints on topology ??-sh+ih sh-ih+n ih-n+??

Seeding (results) • Mono-Gaussian models • Trend continues even with iterative split • CI: 69% WER • CD: 26% WER • Syl-flat: 41% WER • Syl-seed: 31% WER (CD init)

State-tying • Data-driven approach • Backing-off is a problem => train all syllables • too many states/distributions • too little data (skewed distribution) • Same strategy as CD-phone: entropy merge • Can add info (pos, CD) w/o worrying about explosion in # of states

State-tying (2) • Compression w/o performance loss • Phone-internal, state-internal bottom-up merge to limit computations • Count about 10 states per syllable (3.3 phones) • Pos-dep CI syllables • 6000syl model: 59k Gaussians: 38.7% WER • Merged to 6k: (6k Gaussians): 38.6% WER • Trend continues with iterative split

Position-dependent syllables • Word-boundary info (cf triphones) • Example: • Worthiness: _w_er dh_iy n_ih_s • The: _dh_iy_ • Missing from [Sethy+] and [Dod.+] • Results: (3% absolute at every split) • Pos-indep (6k syl): 39.2% WER (2dps) • Pos-dep: 35.7% WER (2dps)

CD-Syllables • Inter-syllable context breaks • Context = next phone • Next syl? Next vowel? (Nucleus/peak) • CD(phone)-syl >= CD triphones • Small gains! • All GD results • CI-syl (6k syl): 19.0% WER • CD-syl (20k syl): 18.5% WER • CD-phones: 18.9% WER

Segmentation • Word and Subword units give poor segmentation • Speaker-adapted overgrown CD-phones are always better • Problem for: MMI and adaptation • Results: (ML) • Word-internal: 21.8% WER • Syl-internal: 19.9% WER

MMI/ADP didn’t work well • MMI: time-constrained to +/- 3ms within word boundary • Blame it on the segmentation (Word-int)

ROVER • Two different systems can be combined • Two-pass “transatlantic” ROVER architecture • CD-phones align, phonetic classes • No gain (broken confidence), deletions • MMI+adp: 15.5% (CDp) and 16.0% (SY) • Best ROVER: 15.5% WER (4-pass, 2-way) Syllable models Adapted CD-phones

Summary: architecture CD-phones POS-CI syllables Merged (6k) POS-CD syllables GD / MMI Merged (3k) Decode Adapt+decode ROVER

Conclusion (Syllable) • Observed similar effects as literature • Added some observations (state tying, CD, pos, ADP/MMI) • Performance does not beat CD-phones yet • CD phones: 15.5% WER ; syl: 16.0% WER • Some assumptions might cancel the benefit of syllable modeling

Open questions • Is syllabification (grouping) better than random? Syllable? • Planned vs spontaneous speech? • Did we oversimplify? • Why do subword units resist to auto-segmentation? • Why didn’t CD-syl work better? • Language-dependent effects

Bubble Splitting • Outgrowth of SAT • Increase model-size 15-fold w/o computational penalty in train/decode • Also covers VTLN implementation • Basic idea: • Split training into locally homogenous regions (bubbles), and then apply SAT

SAT vs Bubble Splitting • SAT relies on locally linearly compactable variabilities • Each Bubble has local variability • Simple acoustic factorization Bubble SAT Adaptation (MLLR)

TDT and speakers labels • TDT is not speaker-labeled • Hub4 has 2400 nominative speakers • Use decoding clustering (show-internal clusters) • Males: 33k speakers • Females: 18k speakers • Shows: 2137 (TDT) + 288 (Hub4)

Input Speech M A L E SPLIT ADAPT Maximum Likelihood Multiplex NORMALIZE TDT F E M A L E NORMALIZE SPLIT ADAPT Decoded Words Bubble-Splitting: Overview Compact Bubble Models (CBM)

VTLN implementation • VTLN is used for clustering • VTLN is a linear feature transformation (almost) • Finding the best warp

VTLN: Linear equivalence • According to [Pitz, ICSLP 2000], VTLN is equivalent to a linear transformation in the cepstral domain: • The relationship between a cepstral coefficient ck and a warped one (stretched or compressed) is as follows: • The Authors didn’t take the Mel-scale into account. No closed-form solution in that case : • Energy, Filter-banks, and cepstral liftering imply non-linear effects

Decode with Aiand λ VTLN is linear Decoding Algorithm: INPUT SPEECH DECODED WORDS Experimental results:

1.00 0.98 3 Q evaluations: N1.00=3 4 Q evaluations: N0.98 = N1.02 = 4 Statistical multiplex • GD-mode • Faster than Brent search • 3 times faster than exhaustive search • Based on prior distribution of alpha • Test 0.98, 1, and 1.02 • If 0.98 wins, continue

Bubble Splitting: Principle Partial center: satmodel λi Training Speaker • Separate conditions • VTLN • Train Bubble model • Compact using SAT • Feature-space SAT Bubble Bi SAT works on homogenous conditions

Results About 0.5% WER reduction Double model size => 0.3% WER

Conclusion (Bubble) • Gain: 0.5% WER • Extension of SAT model compaction • VTLN implementation more efficient

Open questions • Baseline SAT does not work? • Speaker definition? • Best splitting strategy? (One per warp) • Best decoding strategy? (Closest warp) • Best bubble training? (MAP/MLLR) • MMIE

Conclusion • What do we do with all of these data? • Syllable + bubble splitting • Two narrowly explored paths among many • Promising results but nothing breathtaking • Not ambitious enough?

System setup • RT03eval • 6x RT • Same parameters as RT03S eval system • WI triphones, gender dependent, MMI • 2pass • Global MLLU + 7-class MLLR • 39 MFCC + non-causal CMS (2s) • 192k Gaussians, 3400 mixtures • 128 Gaussians / mix => merged

Large Models for Large Corpora: preliminary findings