2.02k likes | 2.04k Views
word. word. ind 1. ind 1. U 1. U 1. sync 1,2. sync 1,2. S 1. S 1. ind 2. ind 2. U 2. U 2. sync 2,3. sync 2,3. S 2. S 2. ind 3. ind 3. U 3. U 3. S 3. S 3. Articulatory Feature-Based Speech Recognition. JHU WS06 Final team presentation August 17, 2006. Project Participants.
E N D
word word ind1 ind1 U1 U1 sync1,2 sync1,2 S1 S1 ind2 ind2 U2 U2 sync2,3 sync2,3 S2 S2 ind3 ind3 U3 U3 S3 S3 Articulatory Feature-Based Speech Recognition JHU WS06 Final team presentationAugust 17, 2006
Project Participants Team members: Karen Livescu (MIT) Arthur Kantor (UIUC) Özgür Çetin (ICSI) Partha Lal (Edinburgh) Mark Hasegawa-Johnson (UIUC) Lisa Yung (JHU) Simon King (Edinburgh) Ari Bezman (Dartmouth) Nash Borges (DoD, JHU) Stephen Dawson-Haggerty (Harvard) Chris Bartels (UW) Bronwyn Woods (Swarthmore) Advisors/satellite members: Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Mathew Magimai (ICSI), Daryush Mehta (MIT), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT), Amar Subramanya (UW)
Why are we here? • Why articulatory feature-based ASR? • Improved modeling of co-articulation • Potential savings in training data • Compatibility with more recent theories of phonology (autosegmental phonology, articulatory phonology) • Application to audio-visual and multilingual ASR • Improved ASR performance with feature-based observation models in some conditions [e.g. Kirchhoff ‘02, Soltau et al. ‘02] • Improved lexical access in experiments with oracle feature transcriptions [Livescu & Glass ’04, Livescu ‘05] • Why now? • A number of sites working on complementary aspects of this idea: U. Edinburgh (King et al.), UIUC (Hasegawa-Johnson et al.), (Livescu et al.) • Recently developed tools (e.g. GMTK) for systematic exploration of the model space
A brief history • Many have argued for replacing the single phone stream with multiple sub-phonetic feature streams [Rose et al., Ostendorf, Nock, Niyogi et al.] • Many have worked on parts of the problem • AF classification/recognition [Kirchhoff, King, Frankel, Wester, Richmond, Hasegawa-Johnson, Borys, Metze, Fosler-Lussier, Greenberg, Chang, Saenko, ...] • Pronunciation modeling [Livescu & Glass, Bates] • Some have combined AF classifiers with phone-based recognizers[Kirchhoff, King, Metze, Soltau, ...] • Some have built HMMs by combining AF states into product states [Deng et al., Richardson and Bilmes] • Only very recently, work has begun on end-to-end recognition with multiple streams of AF states [Hasegawa-Johnson et al., Livescu] • No prior work on AF-based models for AVSR
Yes No factored obs model? state asynchrony cross-word soft asynchrony soft asynchrony within word free within unit coupled state transitions Yes No [Livescu ‘04] [Deng ’97, Richardson ’00] fact. obs? fact. obs? fact. obs? fact. obs? obs model GM SVM NN N N N N Y Y Y Y [Metze ’02] [Kirchhoff ’02] [Juneja ’04] CD CD CD CD CD CD CD CD N N N Y N Y Y Y N Y N [Livescu ’05] N FHMMs ??? ??? Y N Y Y [WS04] [Kirchhoff ’96, Wester et al. ‘04] CHMMs ??? ??? ??? ??? ??? ??? ??? A (partial) taxonomy of design issues factored state (multistream structure)? ... plus, a variety of feature sets!
P(w) language model w = “makes sense...” pronunciation model P(q|w) q = [ m m m ey1 ey1 ey2 k1 k1 k1 k2 k2 s ... ] observation model P(o|q) o = Definitions: Pronunciation and observation modeling
Project goals Building complete AF-based recognizers and understanding the design issues involved A world of areas to explore... • Comparisons of Observation models: Gaussian mixtures over acoustic features, hybrid models, tandem models Pronunciation models: Articulatory asynchrony and substitution models • Analysis of articulatory phenomena: Dependence on context, speaker, speaking rate, speaking style, ... • Application of AFSR to audio-visual speech recognition • Resources Feature sets Manual and automatic AF alignments Tools
That was the vision... At WS06, we focused on • AF-based observation models in the context of phone-based recognizers • AF-based pronunciation models with Gaussian mixture-based observation models • AF-based audio-visual speech recognition • Resources Manual feature alignments Tools: tying, visualization, parallel training and decoding We did not focus on • Integration of AF-based pronunciation models with different observation models • Large-scale analysis of articulatory phenomena
Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines, gmtkTie • Multistream AF-based pronunciation models • For audio-only recognition • For audio-visual recognition • AF-based observation models • Hybrid • Tandem BREAK • Analysis of classifiers and recognizer alignments • Student proposals for future work • Summary and future work
Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines, gmtkTie • Multistream AF-based pronunciation models • For audio-only recognition • For audio-visual recognition • AF-based observation models • Hybrid • Tandem BREAK • Analysis of classifiers and recognizer alignments • Student proposals for future work • Summary and future work
A B C D Bayesian networks (BNs) • Directed acyclic graph (DAG) with one-to-one correspondence between nodes and variables X1, X2, ... , XN • Node Xi with parents pa(Xi) has a “local” probability function pXi|pa(Xi) • Joint probability = product of local probabilities: p(xi,...,xN) = p(xi|pa(xi)) p(b|a) p(a,b,c,d) = p(a)p(b|a)p(c|b)p(d|b,c) p(c|b) p(a) p(d|b,c)
frame i-1 frame i frame i+1 C C C A A B B A B D D D Dynamic Bayesian networks (DBNs) • BNs consisting of a structure that repeats an indefinite (i.e. dynamic) number of times • Useful for modeling time series (e.g. speech!)
FSN DBN frame i-1 frame i frame i+1 .7 .8 1 Qi-1 Qi+1 Qi .3 .2 . . . . . . P(qi|qi-1) P(obsi | qi) 1 2 3 obsi-1 obsi+1 obsi qi 1 2 3 qi-1 q=1 1 .7 .3 0 obs q=2 2 0 .8 .2 obs obs q=3 3 0 0 1 = variable = state = dependency = allowed transition Notation: Representations of HMMs as DBNs
word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {w1, w2, w3, s1,s2,s3,...} observation vector (MFCCs,PLPs) A phone HMM-based recognizer frame 0 frame i last frame variable name values • Standard phone HMM-based recognizer with bigram language model
Inference • Definition: • Computation of the probability of one subset of the variables given another subset • Inference is a subroutine of: • Viterbi decoding argmax p(word, subWordState, phoneState, ...|obs) • Maximum-likelihood parameter estimation * = argmax p(obs| ) • For WS06, all models implemented, trained, and tested using the Graphical Models Toolkit (GMTK) [Bilmes ‘02]
Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines, gmtkTie • Multistream AF-based pronunciation models • For audio-only recognition • For audio-visual recognition • AF-based observation models • Hybrid • Tandem BREAK • Analysis of classifiers and recognizer alignments • Student proposals for future work • Summary and future work
Articulatory feature sets • We use separate feature sets for pronunciation and observation modeling • Why? • For observation modeling, want features that are acoustically distinguishable • For pronunciation modeling, want features that can be modeled as independent streams
TB-LOC TT-LOC TB-OP TT-OP LIP-LOC VELUM LIP-OP GLOTTIS Feature set for pronunciation modeling • Based on articulatory phonology [Browman & Goldstein ‘90] adapted for pronunciation modeling [Livescu ’05] • Under some simplifying assumptions, can combine into 3 streams
Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines, gmtkTie • Multistream AF-based pronunciation models • For audio-only recognition • For audio-visual recognition • AF-based observation models • Hybrid • Tandem BREAK • Analysis of classifiers and recognizer alignments • Student proposals for future work • Summary and future work
SVitchboard presenter: Simon King
Data: SVitchboard - Small Vocabulary Switchboard • SVitchboard [King, Bartels & Bilmes, 2005] is a collection of small-vocabulary tasks extracted from Switchboard 1 • Closed vocabulary: no OOV issues • Various tasks of increasing vocabulary sizes: 10, … 500 words • Pre-defined train/validation/test sets • and 5-fold cross-validation scheme • Utterance fragments extracted from SWB 1 • always surrounded by silence • Word alignments available (msstate) • Whole word HMM baselines already built SVitchboard = SVB
SVitchboard: example utterances • 10 word task • oh • right • oh really • so • well the • 500 word task • oh how funny • oh no • i feel like they need a big home a nice place where someone can have the time to play with them and things but i can't give them up • oh • oh i know it's like the end of the world • i know i love mine too
SVitchboard: isn’t it too easy (or too hard)? • No (no). • Results on the 500 word task test set using a recent SRI system: • SVitchboard data included in the training set for this system • SRI system has 50k vocab • System not tuned to SVB in any way
SVitchboard: what is the point of a 10 word task? • Originally designed for debugging purposes • However, results on the 10 and 500 word tasks obtained in this workshop show good correlation between WERs on the two tasks: WER on 500 word task vs 10 word task 85 80 75 70 65 WER (%) 500 word task 60 55 50 15 17 19 21 23 25 27 29 WER (%) 10 word task
SVitchboard: pre-existing baseline word error rates • Whole word HMMs trained on SVitchboard • these results are from [King, Bartels & Bilmes, 2005] • Built with HTK • Use MFCC observations
SVitchboard: experimental technique • We only perfomed task 1 of SVitchboard (the first of 5 cross-fold sets) • Training set is known as “ABC” • Validation set is known as “D” • Test set is known as “E” • SVitchboard defines cross-validation sets • But these were too big for the very large number of experiments we ran • We mainly used a fixed 500 utterance randomly-chosen subset of “D” which we call the small validation set • All validation set results reported today are on this set, unless stated otherwise
SVitchboard: experimental technique • SVitchboard includes word alignments. • We found that using these made training significantly faster, and gave improved results in most cases • Word alignments are only ever used during training • Results above is for a monophone HMM with PLP observations
SVitchboard: workshop baseline word error rates • Monophone HMMs trained on SVitchboard • PLP observations
SVitchboard: workshop baseline word error rates • Triphone HMMs trained on SVitchboard • PLP observations • 500 word task only • (GMTK system was trained without word alignments)
SVitchboard: baseline word error rates summary • Test set word error rates
gmtkTie presenter: Simon King
gmtkTie • General parameter clustering and tying tool for GMTK • Written for this workshop • Currently most developed parts: • Decision-tree clustering of Gaussians, using same technique as HTK • Bottom-up agglomerative clustering • Decision-tree tying was tested in this workshop on various observation models using Gaussians • Conventional triphone models • Tandem models, including with factored observation streams • Feature based models • Can tie based on values of any variables in the graph, not just the phone state (e.g. feature values)
gmtkTie • gmtkTie is more general than HTK HHEd • HTK asks questions about previous/next phone identity • HTK clusters states only within the same phone • gmtkTie can ask user-supplied questions about user-supplied features: no assumptions about states, triphones, or anything else • gmtkTie clusters user-defined groups of parameters, not just states • gmtkTie can compute cluster sizes and centroids in lots of different ways • GMTK/gmtkTie triphone system built in this workshop is at least as good as HTK system
gmtkTie: conclusions • It works! • Triphone performance at least as good as HTK • Can cluster arbitrary groups of parameters, asking questions about any feature the user can supply • Later in this presentation, we will see an example of separately clustering the Gaussians for two observation streams • Opens up new possibilities for clustering • Much to explore: • Building different decision trees for various factorings of the acoustic observation vector • Asking questions about other contextual factors
Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines, gmtkTie • Multistream AF-based pronunciation models • For audio-only recognition • For audio-visual recognition • AF-based observation models • Hybrid • Tandem BREAK • Analysis of classifiers and recognizer alignments • Student proposals for future work • Summary and future work
Multistream AF-based pronunciation models presenters: Karen Livescu, Chris Bartels, Nash Borges, Bronwyn Woods
Multi-stream AF-based pronunciation models q (phonetic state) • Phone-based o (observation vector) • AF-based qi (state of AF i) o (obs vector)
word don’t probably baseform p r aa b ax b l iy d ow n t (2) p r aa b iy (1) p r ay (1) p r aw l uh (1) p r ah b iy (1) p r aa l iy (1) p r aa b uw (1) p ow ih (1) p aa iy (1) p aa b uh b l iy (1) p aa ah iy (37) d ow n (16) d ow (6) ow n (4) d ow n t (3) d ow t (3) d ah n (3) ow (3) n ax (2) d ax n (2) ax (1) n uw ... surface (actual) Motivation: Pronunciation variation everybody sense s eh n s eh v r iy b ah d iy [From data of Greenberg et al. ‘96] (1) s eh n t s (1) s ih t s (1) eh v r ax b ax d iy (1) eh v er b ah d iy (1) eh ux b ax iy (1) eh r uw ay (1) eh b ah iy
Pronunciation variation and ASR performance • Automatic speech recognition (ASR) is strongly affected by pronunciation variation • Words produced non-canonically are more likely to be mis-recognized [Fosler-Lussier ‘99] • Conversational speech is recognized at twice the error rate of read speech [Weintraub et al. ‘96]
[t] insertion rule dictionary Phone-based pronunciation modeling • Address pronunciation variation issue by substituting, inserting, or deleting segments: • Suffer from low coverage of conversational pronunciations and sparse data • Partial changes are not well described [Saraclar et al. ‘03] increased inter-word confusability sense [ s eh n t s ] / s eh n s /
feature values GLO open critical open VEL closed open closed dictionary TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n s feature values surface variant #1 GLO open critical open VEL closed open closed TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n t s feature values surface variant #2 GLO open critical open VEL closed open closed TB mid / uvular mid-nar / palatal mid / uvular TT critical / alveolar mid-nar / alveolar closed / alveolar critical / alveolar phone s ih t s n Revisiting examples (example of feature asynchrony) (example of feature asynchrony + substitution)
A more complex example everybody [ eh r uw ay ] phonetic transcription subWordStateL L Lsurface subWordStateT T Tsurface
Can we take advantage of these intuitions? • In lexical access experiments with oracle feature alignments, yes: • Lexical access accuracy improves significantly using articulatory model with asynchrony and context-independent substitutions [Livescu & Glass ’04] • WS06 goal: Scale up to a complete recognizer • Challenges • Computational complexity • Modeling the relationship between features and noisy acoustic observations
Reminder: phone-based model frame 0 frame i last frame variable name values word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {w1, w2, w3, s1,s2,s3,...} observation (Note: missing pronunciation variants)
wordTransition word wordTransitionL subWordStateL async stateTransitionL phoneStateL wordTransitionT L subWordStateT stateTransitionT phoneStateT T Recognition with a multistream pronunciation model • Degree of asynchrony ≡ |subWordStateL - subWordStateG| • Forces synchronization at word boundaries • Allows only asnchrony, no substitutions • Differences from implemented model: • Additional feature stream (G) • Pronunciation variants • Word transition bookkeeping