Articulatory Feature-Based Speech Recognition

word word ind1 ind1 U1 U1 sync1,2 sync1,2 S1 S1 ind2 ind2 U2 U2 sync2,3 sync2,3 S2 S2 ind3 ind3 U3 U3 S3 S3 Articulatory Feature-Based Speech Recognition JHU WS06 Final team presentationAugust 17, 2006

Project Participants Team members: Karen Livescu (MIT) Arthur Kantor (UIUC) Özgür Çetin (ICSI) Partha Lal (Edinburgh) Mark Hasegawa-Johnson (UIUC) Lisa Yung (JHU) Simon King (Edinburgh) Ari Bezman (Dartmouth) Nash Borges (DoD, JHU) Stephen Dawson-Haggerty (Harvard) Chris Bartels (UW) Bronwyn Woods (Swarthmore) Advisors/satellite members: Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Mathew Magimai (ICSI), Daryush Mehta (MIT), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT), Amar Subramanya (UW)

Why are we here? • Why articulatory feature-based ASR? • Improved modeling of co-articulation • Potential savings in training data • Compatibility with more recent theories of phonology (autosegmental phonology, articulatory phonology) • Application to audio-visual and multilingual ASR • Improved ASR performance with feature-based observation models in some conditions [e.g. Kirchhoff ‘02, Soltau et al. ‘02] • Improved lexical access in experiments with oracle feature transcriptions [Livescu & Glass ’04, Livescu ‘05] • Why now? • A number of sites working on complementary aspects of this idea: U. Edinburgh (King et al.), UIUC (Hasegawa-Johnson et al.), (Livescu et al.) • Recently developed tools (e.g. GMTK) for systematic exploration of the model space

A brief history • Many have argued for replacing the single phone stream with multiple sub-phonetic feature streams [Rose et al., Ostendorf, Nock, Niyogi et al.] • Many have worked on parts of the problem • AF classification/recognition [Kirchhoff, King, Frankel, Wester, Richmond, Hasegawa-Johnson, Borys, Metze, Fosler-Lussier, Greenberg, Chang, Saenko, ...] • Pronunciation modeling [Livescu & Glass, Bates] • Some have combined AF classifiers with phone-based recognizers[Kirchhoff, King, Metze, Soltau, ...] • Some have built HMMs by combining AF states into product states [Deng et al., Richardson and Bilmes] • Only very recently, work has begun on end-to-end recognition with multiple streams of AF states [Hasegawa-Johnson et al., Livescu] • No prior work on AF-based models for AVSR

Yes No factored obs model? state asynchrony cross-word soft asynchrony soft asynchrony within word free within unit coupled state transitions Yes No [Livescu ‘04] [Deng ’97, Richardson ’00] fact. obs? fact. obs? fact. obs? fact. obs? obs model GM SVM NN N N N N Y Y Y Y [Metze ’02] [Kirchhoff ’02] [Juneja ’04] CD CD CD CD CD CD CD CD N N N Y N Y Y Y N Y N [Livescu ’05] N FHMMs ??? ??? Y N Y Y [WS04] [Kirchhoff ’96, Wester et al. ‘04] CHMMs ??? ??? ??? ??? ??? ??? ??? A (partial) taxonomy of design issues factored state (multistream structure)? ... plus, a variety of feature sets!

P(w) language model w = “makes sense...” pronunciation model P(q|w) q = [ m m m ey1 ey1 ey2 k1 k1 k1 k2 k2 s ... ] observation model P(o|q) o = Definitions: Pronunciation and observation modeling

Project goals Building complete AF-based recognizers and understanding the design issues involved A world of areas to explore... • Comparisons of Observation models: Gaussian mixtures over acoustic features, hybrid models, tandem models Pronunciation models: Articulatory asynchrony and substitution models • Analysis of articulatory phenomena: Dependence on context, speaker, speaking rate, speaking style, ... • Application of AFSR to audio-visual speech recognition • Resources Feature sets Manual and automatic AF alignments Tools

That was the vision... At WS06, we focused on • AF-based observation models in the context of phone-based recognizers • AF-based pronunciation models with Gaussian mixture-based observation models • AF-based audio-visual speech recognition • Resources Manual feature alignments Tools: tying, visualization, parallel training and decoding We did not focus on • Integration of AF-based pronunciation models with different observation models • Large-scale analysis of articulatory phenomena

Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines, gmtkTie • Multistream AF-based pronunciation models • For audio-only recognition • For audio-visual recognition • AF-based observation models • Hybrid • Tandem BREAK • Analysis of classifiers and recognizer alignments • Student proposals for future work • Summary and future work

frame i-1 frame i frame i+1 C C C A A B B A B D D D Dynamic Bayesian networks (DBNs) • BNs consisting of a structure that repeats an indefinite (i.e. dynamic) number of times • Useful for modeling time series (e.g. speech!)

FSN DBN frame i-1 frame i frame i+1 .7 .8 1 Qi-1 Qi+1 Qi .3 .2 . . . . . . P(qi|qi-1) P(obsi | qi) 1 2 3 obsi-1 obsi+1 obsi qi 1 2 3 qi-1 q=1 1 .7 .3 0 obs q=2 2 0 .8 .2 obs obs q=3 3 0 0 1 = variable = state = dependency = allowed transition Notation: Representations of HMMs as DBNs

word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {w1, w2, w3, s1,s2,s3,...} observation vector (MFCCs,PLPs) A phone HMM-based recognizer frame 0 frame i last frame variable name values • Standard phone HMM-based recognizer with bigram language model

Inference • Definition: • Computation of the probability of one subset of the variables given another subset • Inference is a subroutine of: • Viterbi decoding argmax p(word, subWordState, phoneState, ...|obs) • Maximum-likelihood parameter estimation * = argmax p(obs| ) • For WS06, all models implemented, trained, and tested using the Graphical Models Toolkit (GMTK) [Bilmes ‘02]

Articulatory feature sets • We use separate feature sets for pronunciation and observation modeling • Why? • For observation modeling, want features that are acoustically distinguishable • For pronunciation modeling, want features that can be modeled as independent streams

TB-LOC TT-LOC TB-OP TT-OP LIP-LOC VELUM LIP-OP GLOTTIS Feature set for pronunciation modeling • Based on articulatory phonology [Browman & Goldstein ‘90] adapted for pronunciation modeling [Livescu ’05] • Under some simplifying assumptions, can combine into 3 streams

Feature set for observation modeling

SVitchboard presenter: Simon King

Data: SVitchboard - Small Vocabulary Switchboard • SVitchboard [King, Bartels & Bilmes, 2005] is a collection of small-vocabulary tasks extracted from Switchboard 1 • Closed vocabulary: no OOV issues • Various tasks of increasing vocabulary sizes: 10, … 500 words • Pre-defined train/validation/test sets • and 5-fold cross-validation scheme • Utterance fragments extracted from SWB 1 • always surrounded by silence • Word alignments available (msstate) • Whole word HMM baselines already built SVitchboard = SVB

SVitchboard: amount of data

SVitchboard: word frequency distributions

SVitchboard: number of words per utterance

SVitchboard: example utterances • 10 word task • oh • right • oh really • so • well the • 500 word task • oh how funny • oh no • i feel like they need a big home a nice place where someone can have the time to play with them and things but i can't give them up • oh • oh i know it's like the end of the world • i know i love mine too

SVitchboard: isn’t it too easy (or too hard)? • No (no). • Results on the 500 word task test set using a recent SRI system: • SVitchboard data included in the training set for this system • SRI system has 50k vocab • System not tuned to SVB in any way

SVitchboard: what is the point of a 10 word task? • Originally designed for debugging purposes • However, results on the 10 and 500 word tasks obtained in this workshop show good correlation between WERs on the two tasks: WER on 500 word task vs 10 word task 85 80 75 70 65 WER (%) 500 word task 60 55 50 15 17 19 21 23 25 27 29 WER (%) 10 word task

SVitchboard: pre-existing baseline word error rates • Whole word HMMs trained on SVitchboard • these results are from [King, Bartels & Bilmes, 2005] • Built with HTK • Use MFCC observations

SVitchboard: experimental technique • We only perfomed task 1 of SVitchboard (the first of 5 cross-fold sets) • Training set is known as “ABC” • Validation set is known as “D” • Test set is known as “E” • SVitchboard defines cross-validation sets • But these were too big for the very large number of experiments we ran • We mainly used a fixed 500 utterance randomly-chosen subset of “D” which we call the small validation set • All validation set results reported today are on this set, unless stated otherwise

SVitchboard: experimental technique • SVitchboard includes word alignments. • We found that using these made training significantly faster, and gave improved results in most cases • Word alignments are only ever used during training • Results above is for a monophone HMM with PLP observations

SVitchboard: workshop baseline word error rates • Monophone HMMs trained on SVitchboard • PLP observations

SVitchboard: workshop baseline word error rates • Triphone HMMs trained on SVitchboard • PLP observations • 500 word task only • (GMTK system was trained without word alignments)

SVitchboard: baseline word error rates summary • Test set word error rates

gmtkTie presenter: Simon King

gmtkTie • General parameter clustering and tying tool for GMTK • Written for this workshop • Currently most developed parts: • Decision-tree clustering of Gaussians, using same technique as HTK • Bottom-up agglomerative clustering • Decision-tree tying was tested in this workshop on various observation models using Gaussians • Conventional triphone models • Tandem models, including with factored observation streams • Feature based models • Can tie based on values of any variables in the graph, not just the phone state (e.g. feature values)

gmtkTie • gmtkTie is more general than HTK HHEd • HTK asks questions about previous/next phone identity • HTK clusters states only within the same phone • gmtkTie can ask user-supplied questions about user-supplied features: no assumptions about states, triphones, or anything else • gmtkTie clusters user-defined groups of parameters, not just states • gmtkTie can compute cluster sizes and centroids in lots of different ways • GMTK/gmtkTie triphone system built in this workshop is at least as good as HTK system

gmtkTie: conclusions • It works! • Triphone performance at least as good as HTK • Can cluster arbitrary groups of parameters, asking questions about any feature the user can supply • Later in this presentation, we will see an example of separately clustering the Gaussians for two observation streams • Opens up new possibilities for clustering • Much to explore: • Building different decision trees for various factorings of the acoustic observation vector • Asking questions about other contextual factors

Multistream AF-based pronunciation models presenters: Karen Livescu, Chris Bartels, Nash Borges, Bronwyn Woods

Multi-stream AF-based pronunciation models q (phonetic state) • Phone-based o (observation vector) • AF-based qi (state of AF i) o (obs vector)

word don’t probably baseform p r aa b ax b l iy d ow n t (2) p r aa b iy (1) p r ay (1) p r aw l uh (1) p r ah b iy (1) p r aa l iy (1) p r aa b uw (1) p ow ih (1) p aa iy (1) p aa b uh b l iy (1) p aa ah iy (37) d ow n (16) d ow (6) ow n (4) d ow n t (3) d ow t (3) d ah n (3) ow (3) n ax (2) d ax n (2) ax (1) n uw ... surface (actual) Motivation: Pronunciation variation everybody sense s eh n s eh v r iy b ah d iy [From data of Greenberg et al. ‘96] (1) s eh n t s (1) s ih t s (1) eh v r ax b ax d iy (1) eh v er b ah d iy (1) eh ux b ax iy (1) eh r uw ay (1) eh b ah iy

Pronunciation variation and ASR performance • Automatic speech recognition (ASR) is strongly affected by pronunciation variation • Words produced non-canonically are more likely to be mis-recognized [Fosler-Lussier ‘99] • Conversational speech is recognized at twice the error rate of read speech [Weintraub et al. ‘96]

[t] insertion rule dictionary Phone-based pronunciation modeling • Address pronunciation variation issue by substituting, inserting, or deleting segments: • Suffer from low coverage of conversational pronunciations and sparse data • Partial changes are not well described [Saraclar et al. ‘03] increased inter-word confusability sense [ s eh n t s ] / s eh n s /

feature values GLO open critical open VEL closed open closed dictionary TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n s feature values surface variant #1 GLO open critical open VEL closed open closed TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n t s feature values surface variant #2 GLO open critical open VEL closed open closed TB mid / uvular mid-nar / palatal mid / uvular TT critical / alveolar mid-nar / alveolar closed / alveolar critical / alveolar phone s ih t s n Revisiting examples (example of feature asynchrony) (example of feature asynchrony + substitution)

A more complex example everybody [ eh r uw ay ] phonetic transcription subWordStateL L Lsurface subWordStateT T Tsurface

Can we take advantage of these intuitions? • In lexical access experiments with oracle feature alignments, yes: • Lexical access accuracy improves significantly using articulatory model with asynchrony and context-independent substitutions [Livescu & Glass ’04] • WS06 goal: Scale up to a complete recognizer • Challenges • Computational complexity • Modeling the relationship between features and noisy acoustic observations

Reminder: phone-based model frame 0 frame i last frame variable name values word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {w1, w2, w3, s1,s2,s3,...} observation (Note: missing pronunciation variants)

wordTransition word wordTransitionL subWordStateL async stateTransitionL phoneStateL wordTransitionT L subWordStateT stateTransitionT phoneStateT T Recognition with a multistream pronunciation model • Degree of asynchrony ≡ |subWordStateL - subWordStateG| • Forces synchronization at word boundaries • Allows only asnchrony, no substitutions • Differences from implemented model: • Additional feature stream (G) • Pronunciation variants • Word transition bookkeeping

Articulatory Feature-Based Speech Recognition