1 / 202

Articulatory Feature-Based Speech Recognition

word. word. ind 1. ind 1. U 1. U 1. sync 1,2. sync 1,2. S 1. S 1. ind 2. ind 2. U 2. U 2. sync 2,3. sync 2,3. S 2. S 2. ind 3. ind 3. U 3. U 3. S 3. S 3. Articulatory Feature-Based Speech Recognition. JHU WS06 Final team presentation August 17, 2006. Project Participants.

rmartini
Download Presentation

Articulatory Feature-Based Speech Recognition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. word word ind1 ind1 U1 U1 sync1,2 sync1,2 S1 S1 ind2 ind2 U2 U2 sync2,3 sync2,3 S2 S2 ind3 ind3 U3 U3 S3 S3 Articulatory Feature-Based Speech Recognition JHU WS06 Final team presentationAugust 17, 2006

  2. Project Participants Team members: Karen Livescu (MIT) Arthur Kantor (UIUC) Özgür Çetin (ICSI) Partha Lal (Edinburgh) Mark Hasegawa-Johnson (UIUC) Lisa Yung (JHU) Simon King (Edinburgh) Ari Bezman (Dartmouth) Nash Borges (DoD, JHU) Stephen Dawson-Haggerty (Harvard) Chris Bartels (UW) Bronwyn Woods (Swarthmore) Advisors/satellite members: Jeff Bilmes (UW), Nancy Chen (MIT), Xuemin Chi (MIT), Trevor Darrell (MIT), Edward Flemming (MIT), Eric Fosler-Lussier (OSU), Joe Frankel (Edinburgh/ICSI), Jim Glass (MIT), Katrin Kirchhoff (UW), Lisa Lavoie (Elizacorp, Emerson), Mathew Magimai (ICSI), Daryush Mehta (MIT), Kate Saenko (MIT), Janet Slifka (MIT), Stefanie Shattuck-Hufnagel (MIT), Amar Subramanya (UW)

  3. Why are we here? • Why articulatory feature-based ASR? • Improved modeling of co-articulation • Potential savings in training data • Compatibility with more recent theories of phonology (autosegmental phonology, articulatory phonology) • Application to audio-visual and multilingual ASR • Improved ASR performance with feature-based observation models in some conditions [e.g. Kirchhoff ‘02, Soltau et al. ‘02] • Improved lexical access in experiments with oracle feature transcriptions [Livescu & Glass ’04, Livescu ‘05] • Why now? • A number of sites working on complementary aspects of this idea: U. Edinburgh (King et al.), UIUC (Hasegawa-Johnson et al.), (Livescu et al.) • Recently developed tools (e.g. GMTK) for systematic exploration of the model space

  4. A brief history • Many have argued for replacing the single phone stream with multiple sub-phonetic feature streams [Rose et al., Ostendorf, Nock, Niyogi et al.] • Many have worked on parts of the problem • AF classification/recognition [Kirchhoff, King, Frankel, Wester, Richmond, Hasegawa-Johnson, Borys, Metze, Fosler-Lussier, Greenberg, Chang, Saenko, ...] • Pronunciation modeling [Livescu & Glass, Bates] • Some have combined AF classifiers with phone-based recognizers[Kirchhoff, King, Metze, Soltau, ...] • Some have built HMMs by combining AF states into product states [Deng et al., Richardson and Bilmes] • Only very recently, work has begun on end-to-end recognition with multiple streams of AF states [Hasegawa-Johnson et al., Livescu] • No prior work on AF-based models for AVSR

  5. Yes No factored obs model? state asynchrony cross-word soft asynchrony soft asynchrony within word free within unit coupled state transitions Yes No [Livescu ‘04] [Deng ’97, Richardson ’00] fact. obs? fact. obs? fact. obs? fact. obs? obs model GM SVM NN N N N N Y Y Y Y [Metze ’02] [Kirchhoff ’02] [Juneja ’04] CD CD CD CD CD CD CD CD N N N Y N Y Y Y N Y N [Livescu ’05] N FHMMs ??? ??? Y N Y Y [WS04] [Kirchhoff ’96, Wester et al. ‘04] CHMMs ??? ??? ??? ??? ??? ??? ??? A (partial) taxonomy of design issues factored state (multistream structure)? ... plus, a variety of feature sets!

  6. P(w) language model w = “makes sense...” pronunciation model P(q|w) q = [ m m m ey1 ey1 ey2 k1 k1 k1 k2 k2 s ... ] observation model P(o|q) o = Definitions: Pronunciation and observation modeling

  7. Project goals Building complete AF-based recognizers and understanding the design issues involved A world of areas to explore... • Comparisons of Observation models: Gaussian mixtures over acoustic features, hybrid models, tandem models Pronunciation models: Articulatory asynchrony and substitution models • Analysis of articulatory phenomena: Dependence on context, speaker, speaking rate, speaking style, ... • Application of AFSR to audio-visual speech recognition • Resources Feature sets Manual and automatic AF alignments Tools

  8. That was the vision... At WS06, we focused on • AF-based observation models in the context of phone-based recognizers • AF-based pronunciation models with Gaussian mixture-based observation models • AF-based audio-visual speech recognition • Resources Manual feature alignments Tools: tying, visualization, parallel training and decoding We did not focus on • Integration of AF-based pronunciation models with different observation models • Large-scale analysis of articulatory phenomena

  9. Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines, gmtkTie • Multistream AF-based pronunciation models • For audio-only recognition • For audio-visual recognition • AF-based observation models • Hybrid • Tandem BREAK • Analysis of classifiers and recognizer alignments • Student proposals for future work • Summary and future work

  10. Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines, gmtkTie • Multistream AF-based pronunciation models • For audio-only recognition • For audio-visual recognition • AF-based observation models • Hybrid • Tandem BREAK • Analysis of classifiers and recognizer alignments • Student proposals for future work • Summary and future work

  11. A B C D Bayesian networks (BNs) • Directed acyclic graph (DAG) with one-to-one correspondence between nodes and variables X1, X2, ... , XN • Node Xi with parents pa(Xi) has a “local” probability function pXi|pa(Xi) • Joint probability = product of local probabilities: p(xi,...,xN) =  p(xi|pa(xi)) p(b|a)  p(a,b,c,d) = p(a)p(b|a)p(c|b)p(d|b,c) p(c|b) p(a) p(d|b,c)

  12. frame i-1 frame i frame i+1 C C C A A B B A B D D D Dynamic Bayesian networks (DBNs) • BNs consisting of a structure that repeats an indefinite (i.e. dynamic) number of times • Useful for modeling time series (e.g. speech!)

  13. FSN DBN frame i-1 frame i frame i+1 .7 .8 1 Qi-1 Qi+1 Qi .3 .2 . . . . . . P(qi|qi-1) P(obsi | qi) 1 2 3 obsi-1 obsi+1 obsi qi 1 2 3 qi-1 q=1 1 .7 .3 0 obs q=2 2 0 .8 .2 obs obs q=3 3 0 0 1 = variable = state = dependency = allowed transition Notation: Representations of HMMs as DBNs

  14. word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {w1, w2, w3, s1,s2,s3,...} observation vector (MFCCs,PLPs) A phone HMM-based recognizer frame 0 frame i last frame variable name values • Standard phone HMM-based recognizer with bigram language model

  15. Inference • Definition: • Computation of the probability of one subset of the variables given another subset • Inference is a subroutine of: • Viterbi decoding argmax p(word, subWordState, phoneState, ...|obs) • Maximum-likelihood parameter estimation * = argmax p(obs| ) • For WS06, all models implemented, trained, and tested using the Graphical Models Toolkit (GMTK) [Bilmes ‘02]

  16. Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines, gmtkTie • Multistream AF-based pronunciation models • For audio-only recognition • For audio-visual recognition • AF-based observation models • Hybrid • Tandem BREAK • Analysis of classifiers and recognizer alignments • Student proposals for future work • Summary and future work

  17. Articulatory feature sets • We use separate feature sets for pronunciation and observation modeling • Why? • For observation modeling, want features that are acoustically distinguishable • For pronunciation modeling, want features that can be modeled as independent streams

  18. TB-LOC TT-LOC TB-OP TT-OP LIP-LOC VELUM LIP-OP GLOTTIS Feature set for pronunciation modeling • Based on articulatory phonology [Browman & Goldstein ‘90] adapted for pronunciation modeling [Livescu ’05] • Under some simplifying assumptions, can combine into 3 streams

  19. Feature set for observation modeling

  20. Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines, gmtkTie • Multistream AF-based pronunciation models • For audio-only recognition • For audio-visual recognition • AF-based observation models • Hybrid • Tandem BREAK • Analysis of classifiers and recognizer alignments • Student proposals for future work • Summary and future work

  21. SVitchboard presenter: Simon King

  22. Data: SVitchboard - Small Vocabulary Switchboard • SVitchboard [King, Bartels & Bilmes, 2005] is a collection of small-vocabulary tasks extracted from Switchboard 1 • Closed vocabulary: no OOV issues • Various tasks of increasing vocabulary sizes: 10, … 500 words • Pre-defined train/validation/test sets • and 5-fold cross-validation scheme • Utterance fragments extracted from SWB 1 • always surrounded by silence • Word alignments available (msstate) • Whole word HMM baselines already built SVitchboard = SVB

  23. SVitchboard: amount of data

  24. SVitchboard: amount of data

  25. SVitchboard: word frequency distributions

  26. SVitchboard: number of words per utterance

  27. SVitchboard: example utterances • 10 word task • oh • right • oh really • so • well the • 500 word task • oh how funny • oh no • i feel like they need a big home a nice place where someone can have the time to play with them and things but i can't give them up • oh • oh i know it's like the end of the world • i know i love mine too

  28. SVitchboard: isn’t it too easy (or too hard)? • No (no). • Results on the 500 word task test set using a recent SRI system: • SVitchboard data included in the training set for this system • SRI system has 50k vocab • System not tuned to SVB in any way

  29. SVitchboard: what is the point of a 10 word task? • Originally designed for debugging purposes • However, results on the 10 and 500 word tasks obtained in this workshop show good correlation between WERs on the two tasks: WER on 500 word task vs 10 word task 85 80 75 70 65 WER (%) 500 word task 60 55 50 15 17 19 21 23 25 27 29 WER (%) 10 word task

  30. SVitchboard: pre-existing baseline word error rates • Whole word HMMs trained on SVitchboard • these results are from [King, Bartels & Bilmes, 2005] • Built with HTK • Use MFCC observations

  31. SVitchboard: experimental technique • We only perfomed task 1 of SVitchboard (the first of 5 cross-fold sets) • Training set is known as “ABC” • Validation set is known as “D” • Test set is known as “E” • SVitchboard defines cross-validation sets • But these were too big for the very large number of experiments we ran • We mainly used a fixed 500 utterance randomly-chosen subset of “D” which we call the small validation set • All validation set results reported today are on this set, unless stated otherwise

  32. SVitchboard: experimental technique • SVitchboard includes word alignments. • We found that using these made training significantly faster, and gave improved results in most cases • Word alignments are only ever used during training • Results above is for a monophone HMM with PLP observations

  33. SVitchboard: workshop baseline word error rates • Monophone HMMs trained on SVitchboard • PLP observations

  34. SVitchboard: workshop baseline word error rates • Triphone HMMs trained on SVitchboard • PLP observations • 500 word task only • (GMTK system was trained without word alignments)

  35. SVitchboard: baseline word error rates summary • Test set word error rates

  36. gmtkTie presenter: Simon King

  37. gmtkTie • General parameter clustering and tying tool for GMTK • Written for this workshop • Currently most developed parts: • Decision-tree clustering of Gaussians, using same technique as HTK • Bottom-up agglomerative clustering • Decision-tree tying was tested in this workshop on various observation models using Gaussians • Conventional triphone models • Tandem models, including with factored observation streams • Feature based models • Can tie based on values of any variables in the graph, not just the phone state (e.g. feature values)

  38. gmtkTie • gmtkTie is more general than HTK HHEd • HTK asks questions about previous/next phone identity • HTK clusters states only within the same phone • gmtkTie can ask user-supplied questions about user-supplied features: no assumptions about states, triphones, or anything else • gmtkTie clusters user-defined groups of parameters, not just states • gmtkTie can compute cluster sizes and centroids in lots of different ways • GMTK/gmtkTie triphone system built in this workshop is at least as good as HTK system

  39. gmtkTie: conclusions • It works! • Triphone performance at least as good as HTK • Can cluster arbitrary groups of parameters, asking questions about any feature the user can supply • Later in this presentation, we will see an example of separately clustering the Gaussians for two observation streams • Opens up new possibilities for clustering • Much to explore: • Building different decision trees for various factorings of the acoustic observation vector • Asking questions about other contextual factors

  40. Outline • Preliminaries: Dynamic Bayesian networks, feature sets, data, baselines, gmtkTie • Multistream AF-based pronunciation models • For audio-only recognition • For audio-visual recognition • AF-based observation models • Hybrid • Tandem BREAK • Analysis of classifiers and recognizer alignments • Student proposals for future work • Summary and future work

  41. Multistream AF-based pronunciation models presenters: Karen Livescu, Chris Bartels, Nash Borges, Bronwyn Woods

  42. Multi-stream AF-based pronunciation models q (phonetic state) • Phone-based o (observation vector) • AF-based qi (state of AF i) o (obs vector)

  43. word don’t probably baseform p r aa b ax b l iy d ow n t (2) p r aa b iy (1) p r ay (1) p r aw l uh (1) p r ah b iy (1) p r aa l iy (1) p r aa b uw (1) p ow ih (1) p aa iy (1) p aa b uh b l iy (1) p aa ah iy (37) d ow n (16) d ow (6) ow n (4) d ow n t (3) d ow t (3) d ah n (3) ow (3) n ax (2) d ax n (2) ax (1) n uw ... surface (actual) Motivation: Pronunciation variation everybody sense s eh n s eh v r iy b ah d iy [From data of Greenberg et al. ‘96] (1) s eh n t s (1) s ih t s (1) eh v r ax b ax d iy (1) eh v er b ah d iy (1) eh ux b ax iy (1) eh r uw ay (1) eh b ah iy

  44. Pronunciation variation and ASR performance • Automatic speech recognition (ASR) is strongly affected by pronunciation variation • Words produced non-canonically are more likely to be mis-recognized [Fosler-Lussier ‘99] • Conversational speech is recognized at twice the error rate of read speech [Weintraub et al. ‘96]

  45. [t] insertion rule dictionary Phone-based pronunciation modeling • Address pronunciation variation issue by substituting, inserting, or deleting segments: • Suffer from low coverage of conversational pronunciations and sparse data • Partial changes are not well described [Saraclar et al. ‘03] increased inter-word confusability sense [ s eh n t s ] / s eh n s /

  46. feature values GLO open critical open VEL closed open closed dictionary TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n s feature values surface variant #1 GLO open critical open VEL closed open closed TB mid / uvular mid / palatal mid / uvular TT critical / alveolar mid / alveolar closed / alveolar critical / alveolar phone s eh n t s feature values surface variant #2 GLO open critical open VEL closed open closed TB mid / uvular mid-nar / palatal mid / uvular TT critical / alveolar mid-nar / alveolar closed / alveolar critical / alveolar phone s ih t s n Revisiting examples (example of feature asynchrony) (example of feature asynchrony + substitution)

  47. A more complex example everybody [ eh r uw ay ] phonetic transcription subWordStateL L Lsurface subWordStateT T Tsurface

  48. Can we take advantage of these intuitions? • In lexical access experiments with oracle feature alignments, yes: • Lexical access accuracy improves significantly using articulatory model with asynchrony and context-independent substitutions [Livescu & Glass ’04] • WS06 goal: Scale up to a complete recognizer • Challenges • Computational complexity • Modeling the relationship between features and noisy acoustic observations

  49. Reminder: phone-based model frame 0 frame i last frame variable name values word {“one”, “two” ,...} 1 wordTransition {0,1} 0 subWordState {0,1,2,...} stateTransition {0,1} phoneState {w1, w2, w3, s1,s2,s3,...} observation (Note: missing pronunciation variants)

  50. wordTransition word wordTransitionL subWordStateL async stateTransitionL phoneStateL wordTransitionT L subWordStateT stateTransitionT phoneStateT T Recognition with a multistream pronunciation model • Degree of asynchrony ≡ |subWordStateL - subWordStateG| • Forces synchronization at word boundaries • Allows only asnchrony, no substitutions • Differences from implemented model: • Additional feature stream (G) • Pronunciation variants • Word transition bookkeeping

More Related