310 likes | 349 Views
Explore the use of dynamic Bayesian networks for feature-based pronunciation modeling to address pronunciation variation in automatic speech recognition. This approach involves multiple sequences of linguistic features and integrates with SVM feature classifiers.
E N D
Feature-based Pronunciation ModelingUsing Dynamic Bayesian Networks Karen Livescu JHU Workshop Planning Meeting April 16, 2004 Joint work with Jim Glass
Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers
The problem of pronunciation variation • Conversation from the Switchboard speech database: • “neither one of them”: • “decided”: • “never really”: • “probably”: • Noted as an obstacle for ASR (e.g., [McAllester et al. 1998])
The problem of pronunciation variation (2) • More acute in casual/conversational than in read speech: probably p r aa b iy 2 p r ay 1 p r aw l uh 1 p r ah b iy 1 p r aa lg iy 1 p r aa b uw 1 p ow ih 1 p aa iy 1 p aa b uh b l iy 1 p aa ah iy 1
Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers
[p] insertion rule dictionary Traditional solution: phone-based pronunciation modeling • Transformation rules are typically of the form p1 p2 / p3 __ p4 (where pimay be null) • E.g. Ø p / m __ {non-labial} • Rules are derived from • Linguistic knowledge (e.g. [Hazen et al. 2002]) • Data (e.g. [Riley & Ljolje 1996]) • Powerful, but: • Sparse data issues • Increased inter-word confusability • Some pronunciation changes not well described • Limited success in recognition experiments warmth [ w ao r m p th ] / w ao r m th /
Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers
TB-LOC TT-LOC TB-OPEN VELUM TT-OPEN LIP-OP VOICING A feature-based approach • Speech can alternatively be described using sub-phonetic features • (This feature set based on articulatory phonology [Browman & Goldstein 1990])
voicing V V V V !V lips & velum desynchronize velum Clo Clo Clo Op Clo dictionary lip opening Nar Mid Mid Clo Mid ... ... ... ... ... … Feature-based pronunciation modeling • instruments[ih_n s ch em ih_n n s] [ w ao r m p th ] warmth • wants[w aa_n t s] -- Phone deletion?? • several[s eh r v ax l] -- Exchange of two phones??? everybody[eh r uw ay]
Related work • Much work on classifying features: • [King et al. 1998] • [Kirchhoff2002] • [Chang, Greenberg, & Wester 2001] • [Juneja & Espy-Wilson 2003] • [Omar & Hasegawa-Johnson 2002] • [Niyogi & Burges 2002] • Less work on “non-phonetic” relationship between words and features • [Deng et al. 1997], [Richardson & Bilmes 2000]: “fully-connected” state space via hidden Markov model • [Kirchhoff 1996]: features independent, except for synchronization at syllable boundaries • [Carson-Berndsen 1998]: bottom-up, constraint-based approach • Goal: Develop a general feature-based pronunciation model • Capable of using known independence assumptions • Without overly strong assumptions
index 0 1 2 3 4 voicing V V V V !V velum Off Off Off On Off lip opening Nar Mid Mid Clo Mid ... ... ... ... ... … dictionary Approach: Main Ideas ([HLT/NAACL-2004]) • Begin with usual assumption: Each word has one or more underlying pronunciations, given by a dictionary warmth • Surface (actual) feature values can stray from underlying values via: • Substitution – modeled by confusion matrices P(s|u) • Asynchrony • Assign index (counter) to each feature, and allow index values to differ • Apply constraints on the difference between the mean indices of feature subsets • Natural to implement using graphical models, in particular dynamic Bayesian networks (DBNs)
speaking rate # questions lunchtime frame i-1 framei ... ... S S O O Aside: Dynamic Bayesian networks • Bayesian network (BN): Directed-graph representation of a distribution over a set of variables • Graph node variable + its distribution given parents • Graph edge “dependency” • Dynamic Bayesian network (DBN): BN with a repeating structure • Example: HMM • Uniform algorithms for (among other things) • Finding the most likely values of a subset of the variables, given the rest (analogous to Viterbi algorithm for HMMs) • Learning model parameters via EM
Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers
encodes baseform pronunciations CLO CRI NAR N-M MID … CLO .7 .2 .1 0 0 … CRI 0 .7 .2 .1 0 … NAR 0 0 .7 .2 .1 … … … … … … … … Approach: A DBN-based Model • Example DBN using 3 features: • (Simplified to show important properties! Implemented model has additional variables.)
Approach: A DBN-based Model (2) • “Unrolled” DBN: . . . • Parameter learning via Expectation Maximization (EM) • Training data • Articulatory databases • Detailed phonetic transcriptions
Preview • The problem of pronunciation variation for automatic speech recognition (ASR) • Traditional methods: phone-based pronunciation modeling • Proposed approach: pronunciation modeling via multiple sequences of linguistic features • A natural framework: dynamic Bayesian networks (DBNs) • A feature-based pronunciation model using DBNs • Proof-of-concept experiments • Ongoing/future work • Integration with SVM feature classifiers
A proof-of-concept experiment • Task: classify an isolated word from the Switchboard corpus, given a detailed phonetic transcription (from ICSI Berkeley, [Greenberg et al. 1996]) • Convert transcription into feature vectors Si, one per 10ms • For each word w in a 3k+ word vocabulary, compute P(w|Si) • Output w* = arg maxw P(w|Si) • Used GMTK [Bilmes & Zweig 2002] for inference and EM parameter training • Note: ICSI transcription is somewhere between phones and features—not ideal, but as good as we have
1.7 prons/word 4 prons/word asynchronous feature-based 29.7 16.4 Model Error rate (%) Failure rate (%) asynch. + segmental constraint 32.7 19.4 Baseforms only 63.6 61.2 + phonological rules 50.3 47.9 27.8 synchronous feature-based 35.2 24.8 asynch. + segmental constraint + EM 19.4 Results (development set) • What didn’t work? • Some deletions ([ax], [t]) • Vowel retroflexion • Alveolar + [y] palatal • (Cross-word effects) • (Speech/transcription errors…) • When did asynchrony matter? • Vowel nasalization & rounding • Nasal + stop nasal • Some schwa deletions • instruments [ih_n s ch em ih_n n s] • everybody [eh r uw ay]
Sample Viterbi path everybody [ eh r uw ay ]
Ongoing/future work • Trainable synchrony constraints ([ICSLP 2004?]) • Context-dependent distributions for underlying (Ui) and surface (Si) feature values • Extension to more complex tasks (multi-word sequences, larger vocabularies) • Implementation in a complete recognizer (cf. [Eurospeech 2003]) • Articulatory databases for parameter learning/testing • Can we use such a model to learn something about speech?
(rest of model) Integration with feature classifier outputs • Use (hard) classifier decisions as observations for Si • Convert classifier scores to posterior probabilities and use as “soft evidence” for Si • Landmark-based classifier outputs to DBN Si’s: • Convert landmark-based features to one feature vector/frame • (Possibly) convert from SVM feature set to DBN feature set
Acknowledgment • Jeff Bilmes, U. Washington
possible pronunciations (typically phone strings) Bayes’ Rule acoustic model pronunciation model language model Background: Continuous Speech Recognition • Given waveform with acoustic features A, find most likely word string : • Assuming U* much more likely than all other U:
Lips, tongue, velum, glottis: Right on it, sir! Lips, tongue, velum, glottis: Right on it, sir! Lips, tongue, velum, glottis: Right on it, sir! Lips, tongue, velum, glottis: Right on it, sir! Velum, glottis: Right on it, sir ! Velum, glottis: Right on it, sir ! Example: “warmth” “warmpth” Brain: Give me a []! • Phone-based view: Brain: Give me a []! • (Articulatory) feature-based view: Lips: Huh? Tongue: Umm…yeah, OK.
Graphical models for hidden feature modeling • Most ASR approaches use hidden Markov models (HMMs) and/or finite-state transducers (FSTs) • Efficient and powerful, but limited • Only one state variable per time frame • Graphical models (GMs) allow for • Arbitrary numbers of variables and dependencies • Standard algorithms over large classes of models • Straightforward mapping between feature-based models and GMs • Potentially large reduction in number of parameters • GMs for ASR: • Zweig (e.g. PhD thesis, 1998), Bilmes (e.g. PhD thesis, 1999), Stephenson (e.g. Eurospeech 2001) • Feature-based ASR with GMs suggested by Zweig, but not previously investigated
Background • Brief intro to ASR • Words written in terms of sub-word units, acoustic models compute probability of acoustic (spectral) features given sub-word units or vice versa • Pronunciation model: mapping between words and strings of sub-word units
Possible solution? • Allow every pronunciation in some large database • Unreliable probability estimation due to sparse data • Unseen words • Increased confusability
Phone-based pronunciation modeling (2) • Generalize across words • But: • Data still sparse • Still increased confusability • Some pronunciation changes not well described by phonetic rules • Limited gains in speech recognition experiments
Approach • Begin with usual assumption that each word has one or more “target” pronunciations, given by the dictionary • Model the evolution of multiple feature streams, allowing for: • Feature changes on a frame-by-frame basis • Feature desynchronization • Control of asynchrony—more “synchronous” feature configurations are preferable • Dynamic Bayesian networks (DBNs): Efficient parameterization and computation when state can be factored