Articulatory Feature-Based Speech Recognition

word word ind1 ind1 U1 U1 sync1,2 sync1,2 S1 S1 ind2 ind2 U2 U2 sync2,3 sync2,3 S2 S2 ind3 ind3 U3 U3 S3 S3 JHU WS06 Final team presentationAugust 17, 2006 Articulatory Feature-Based Speech Recognition

Project Participants Team members: Karen Livescu (MIT) Arthur Kantor (UIUC) Özgür Çetin (ICSI) Partha Lal (Edinburgh) Mark Hasegawa-Johnson (UIUC) Lisa Yung (JHU) Simon King (Edinburgh) Ari Bezman (Dartmouth) Nash Borges (DoD, JHU) Stephen Dawson-Haggerty (Harvard) Chris Bartels (UW) Bronwyn Woods (Swarthmore) Satellite members/advisors: Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Ghinwa Choueiter (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Mathew Magimai (ICSI), Erik McDermott (NTT), Daryush Mehta (MIT), Florian Metze (Deutsche Telekom), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT), Amar Subramanya (UW)

Why are we here? • Why articulatory feature-based ASR? • Improved modeling of co-articulation • Application to audio-visual and multilingual ASR • Potential savings in training data • Compatibility with more recent theories of phonology (autosegmental phonology, articulatory phonology) • Improved ASR performance with feature-based observation models in some conditions [e.g. Kirchhoff ‘02, Soltau et al. ‘02] • Improved lexical access in experiments with oracle feature transcriptions [Livescu & Glass ’04, Livescu ‘05] • Why now? • A number of sites working on complementary aspects of this idea: U. Edinburgh (King et al.), UIUC (Hasegawa-Johnson et al.), (Livescu et al.) • Recently developed tools (e.g. GMTK) for systematic exploration of the model space

A brief history • Many have argued for replacing the single phone stream with multiple sub-phonetic feature streams [Rose et al. ‘95, Ostendorf ‘99, ‘00, Nock ‘00, ‘02, Niyogi et al. ’99] • Many have worked on parts of the problem • AF classification/recognition [Kirchhoff, King, Frankel, Wester, Richmond, Hasegawa-Johnson, Borys, Metze, Fosler-Lussier, Greenberg, Chang, Saenko, ...] • Pronunciation modeling [Livescu & Glass, Bates] • Many have combined AF classifiers with phone-based recognizers[Kirchhoff, King, Metze, Soltau, ...] • Some have built HMMs by combining AF states into product states [Deng et al., Richardson and Bilmes] • Only very recently, work has begun on end-to-end recognition with multiple streams of AF states [Hasegawa-Johnson et al. ‘04, Livescu ’05] • No prior work on AF-based models for AVSR

Yes No factored obs model? state asynchrony cross-word soft asynchrony soft asynchrony within word free within unit coupled state transitions Yes No [Livescu ‘04] [Deng ’97, Richardson ’00] fact. obs? fact. obs? fact. obs? fact. obs? obs model GM SVM NN N N N N Y Y Y Y [Metze ’02] [Kirchhoff ’02] [Juneja ’04] CD CD CD CD CD CD CD CD N N N Y N Y Y Y N Y N [Livescu ’05] N FHMMs ??? ??? Y N Y Y [WS04] [Kirchhoff ’96, Wester et al. ‘04] CHMMs ??? ??? ??? ??? ??? ??? ??? A (partial) taxonomy of design issues factored state (multistream structure)? (Not to mention choice of feature sets... same in pronunciation and observation models?)

P(w) language model w = “makes sense...” pronunciation model P(q|w) s = [ m m m ey1 ey1 ey2 k1 k1 k1 k2 k2 s ... ] observation model P(o|q) o = Definitions: Pronunciation and observation modeling

Project goals Building complete AF-based recognizers and understanding the design issues involved A world of areas to explore... • Comparisons of Observation models: Gaussian mixtures over acoustic features, hybrid models [Morgan & Bourlard 1995], tandem models [Ellis et al. 2001] Pronunciation models: Articulatory asynchrony and substitution models • Analysis of articulatory phenomena: Dependence on context, speaker, speaking rate, speaking style, ... • Application of AFSR to audio-visual speech recognition • All require some resources... Feature sets Manual and automatic AF alignments Tools

That was the vision... At WS06, we focused on... • AF-based observation models in the context of phone-based recognizers • AF-based pronunciation models with Gaussian mixture-based observation models • AF-based audio-visual speech recognition • Resources Feature sets Manual AF alignments Tools: tying, visualization We did not focus on... • Combination of AF-based pronunciation models with different observation models • Analysis of feature alignment data

Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines (Karen, Simon) • Hybrid observation models (Simon) • Tandem observation models (Ozgur, Arthur) • Multistream AF-based pronunciation models (Karen, Chris, Nash, Lisa, Bronwyn) • AF-based audio-visual speech recognition (Mark, Partha) • Analysis (Nash, Lisa, Ari) BREAK • Structure learning (Steve) • Student proposals (Arthur, Chris, Partha, Bronwyn?) • Summary, conclusions, future work (Karen)

Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines • Hybrid observation models • Tandem observation models • Multistream AF-based pronunciation models • AF-based audio-visual speech recognition • Analysis • BREAK • Structure learning • Student proposals • Summary, conclusions, future work

frame i-1 frame i frame i+1 C C C A A B B A B D D D Dynamic Bayesian networks (DBNs) • BNs consisting of a structure that repeats an indefinite (i.e. dynamic) number of times • Useful for modeling time series (e.g. speech!)

FSN DBN frame i-1 frame i frame i+1 .7 .8 1 Qi-1 Qi+1 Qi .3 .2 . . . . . . P(qi|qi-1) P(obsi | qi) 1 2 3 obsi-1 obsi+1 obsi qi 1 2 3 qi-1 q=1 1 .7 .3 0 obs q=2 2 0 .8 .2 obs obs q=3 3 0 0 1 = variable = state = dependency = allowed transition Notation: Representations of HMMs as DBNs

A phone HMM-based recognizer frame 0 frame i last frame variable name values word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {w1, w2, w3, s1,s2,s3,...} observation

Inference • Definition: • Computation of the probability of one subset of the variables given another subset • Inference is a subroutine of: • Viterbi decoding q* = argmaxqp(q|obs) • Maximum-likelihood parameter estimation * = argmax p(obs| ) • For WS06, all models implemented, trained, and tested using the Graphical Models Toolkit (GMTK) [Bilmes 2002]

Articulatory feature sets • We use separate feature sets for pronunciation and observation modeling • Why? • For observation modeling, want features that are acoustically distinguishable • For pronunciation modeling, want features that can be modeled as independent streams

TB-LOC TT-LOC TB-OP TT-OP LIP-LOC VELUM LIP-OP GLOTTIS Feature set for pronunciation modeling • Based on articulatory phonology [Browman & Goldstein ‘90] adapted for pronunciation modeling [Livescu ’05] • Under some simplifying assumptions, can combine into 3 streams

Feature set for observation modeling

Manual feature transcriptions • Purpose: Testing of AF classifiers, automatic alignments, NOT training • Main transcription guideline: Should correspond to what we would like our AF classifiers to detect

Manual feature transcriptions • Main transcription guideline: The output should correspond to what we would like our AF classifiers to detect • Details • 2 transcribers: phonetician (Lisa Lavoie), PhD student in speech group (Xuemin Chi) • 78 SVitchboard utterances • 9 utterances from Switchboard Transcription Project for comparison • Multipass transcription using WaveSurfer (KTH) • 1st pass: Phone-feature hybrid • 2nd pass: All-feature • 3rd pass: Discussion, error-correction • Some basic statistics • Overall speed ~1000 x real-time • High inter-transcriber agreement (93% avg. agreement, 85% avg. string accuracy) • First use to date of human-labeled articulatory data for classifier/recognizer testing

SIMON: SVitchboard, baselines, gmtkTie MLPs, hybrid models

OZGUR & ARTHUR: Tandem models intro Our models & results

KAREN, CHRIS, NASH, LISA, BRONWYN: Multistream AF-based pronunciation models

Multi-stream AF-based pronunciation models q (phonetic state) • Phone-based o (observation vector) • AF-based qi (state of AF i) o (obs vector)

word don’t probably baseform p r aa b ax b l iy d ow n t (2) p r aa b iy (1) p r ay (1) p r aw l uh (1) p r ah b iy (1) p r aa l iy (1) p r aa b uw (1) p ow ih (1) p aa iy (1) p aa b uh b l iy (1) p aa ah iy (37) d ow n (16) d ow (6) ow n (4) d ow n t (3) d ow t (3) d ah n (3) ow (3) n ax (2) d ax n (2) ax (1) n uw ... surface (actual) Motivation: Pronunciation variation everybody sense s eh n s eh v r iy b ah d iy [From data of Greenberg et al. ‘96] (1) s eh n t s (1) s ih t s (1) eh v r ax b ax d iy (1) eh v er b ah d iy (1) eh ux b ax iy (1) eh r uw ay (1) eh b ah iy

Pronunciation variation and ASR performance • Automatic speech recognition (ASR) is strongly affected by pronunciation variation • Words produced non-canonically are more likely to be mis-recognized [Fosler-Lussier ‘99] • Conversational speech is recognized at twice the error rate of read speech [Weintraub et al. ‘96]

[t] insertion rule dictionary Phone-based pronunciation modeling • Address pronunciation variation issue by substituting, inserting, or deleting segments: • Suffer from low coverage of conversational pronunciations and sparse data • Partial changes are not well described [Saraclar et al. ‘03] increased inter-word confusability sense [ s eh n t s ] / s eh n s /

feature values GLO open critical open VEL closed open closed dictionary TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n s feature values GLO open critical open VEL closed open closed surface variant #1 TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n t s feature values GLO open critical open VEL closed open closed surface variant #2 TB mid / uvular mid-nar / palatal mid / uvular TT critical / alveolar mid-nar / alveolar closed / alveolar critical / alveolar phone s ih t s n Revisiting examples

A more complex example everybody [ eh r uw ay ] (INSERT DIFFERENT EXAMPLE USING ARI’S TOOL)

Can we take advantage of these intuitions? • In lexical access experiments with oracle feature alignments, yes: • Lexical access accuracy improves from ?? to ?? using articulatory model with asynchrony and context-independent substitutions [Livescu & Glass ’04] • Scaling up to a complete recognizer—issues: • Computational complexity • Noisy observations

Reminder: phone-based model frame 0 frame i last frame variable name values word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {w1, w2, w3, s1,s2,s3,...} observation (Note: missing pronunciation variants)

Multistream pronunciation models wordTransition word wordTransitionL subWordStateL async stateTransitionL phoneStateL wordTransitionT L subWordStateT stateTransitionT phoneStateT T (differences from actual model: 3rd feature stream, pronunciation variants, word transition bookkeeping)

A first attempt: 1-state monofeat • Analogous to 1-state monophone with minimum duration of 3 frames • All three states of each phone map to the same feature values • INSERT PART OF phoneState2feat TABLE HERE • One state of asynchrony allowed between L and T, and between G and {L,T}

A first attempt: 1-state monofeat • (INSERT EXAMPLE USING ARI’s TOOL)

Results: 1-state monofeat • Much higher WER than monophone—possible remedies • Improved modeling with the same structure—concerted effort at WS06 • Alternative structures—begun to explore • Cross-word asynchrony • Context-dependent asynchrony • Substitutions

CHRIS, NASH, LISA, BRONWYN: Multistream AF-based pronunciation models

Improving the Model • Design Challenges • Multiple states per feature • Initialization • Tying • Silence synchronization

Design Challenges • Optimal parameters vary widely for different models • Number of components • Language model scale • Language model penalty

Design Challenges • Experimentation time grows with model complexity • Adding features to monofeat graph:

3-State Monofeat • Monofeat usually maps all states of a phone to the same feature value

3-State Monofeat • 3 State Monofeat makes a unique feature value for each phone state

3-State Monofeat • Why? • Forces a sequence of states • Models context

3-State Monofeat

Initialization • Problem: • Low occupancy leads to many poorly trained Gaussians • Potential Solution • Better initialization • Train through 8 components per mixtures with asynchrony parameters clamped at: • p(synchronous)=0.5, p(asynchronous)=0.5 • p(synchronous)=0.6, p(asynchronous)=0.4

Initialization Original Initialization O.5/0.5 Initializatoin

Initialization

Tying • Problem: • Low occupancy leads to many poorly trained Gaussians • Potential Solution • Parameter tying • Uses gmtkTie

Articulatory Feature-Based Speech Recognition