310 likes | 336 Views
Learn about landmark-based speech recognition and distinctive features through acoustic modeling and pronunciation modeling with support vector machines (SVM) and a dynamic Bayesian network (DBN) for improved speech recognition technology.
E N D
Acoustic Landmarks and Articulatory Phonology for Automatic Speech Recognition Mark Hasegawa-Johnson Research performed in collaboration with James Baker (Carnegie Mellon), Sarah Borys (Illinois), Ken Chen (Illinois), Emily Coogan (Illinois), Steven Greenberg (Berkeley), Amit Juneja (Maryland), Katrin Kirchhoff (Washington), Karen Livescu (MIT), Srividya Mohan (Johns Hopkins), Jen Muller (Dept. of Defense), Kemal Sonmez (SRI), and Tianyu Wang (Georgia Tech)
What are Landmarks? • Time-frequency regions of high mutual information between phone and signal (maxima of I(q;X(t,f)) ) • Acoustic events with similar importance in all languages, and across all speaking styles • Acoustic events that can be detected even in extremely noisy environments Where do these things happen? • Syllable Onset ≈ Consonant Release • Syllable Nucleus ≈ Vowel Center • Syllable Coda ≈ Consonant Closure I(q;X(t,f)) experiment: Hasegawa-Johnson, 2000
Landmark-Based Speech Recognition Lattice hypothesis: … backed up … Words Times Scores Pronunciation Variants: … backed up … … backtup .. … back up … … backt ihp … … wackt ihp… … ONSET ONSET Syllable Structure NUCLEUS NUCLEUS CODA CODA
Outline • Scientific and Technological Goals • Acoustic Modeling • Speech data and acoustic features • Landmark detection • Estimation of real-valued “distinctive features” using support vector machines (SVM) • Pronunciation Modeling • Dynamic Bayesian network (DBN) • Integration of SVM probability estimates with DBN • Technological Evaluation • Word lattice output from an HMM-based recognizer • Rescoring • Results so far • Future Plans
Scientific and Technological Goals • Acoustic • Learn precise and generalizable models of the acoustic boundary associated with each distinctive feature, • … in an acoustic feature space including representative samples of spectral, phonetic, and auditory features, • … with regularized learners that trade off training corpus error against estimated generalization error in a very-high-dimensional model space • Phonological • Represent a large number of pronunciation variants, in a controlled fashion, by factoring the pronunciation model into distinct articulatory gestures, • … by integrating pseudo-probabilistic soft evidence into a Bayesian network • Technological • A lattice-rescoring pass that reduces word error rate of a speech recognizer
Acoustic and Auditory Features • MFCCs • 5ms skip, 25ms window (standard ASR features) • 1ms skip, 4ms window (equivalent to calculation of energy, spectral tilt, and spectral compactness once/millisecond) • Formant frequencies, once/5ms • ESPS LPC-based formant frequencies and bandwidths • Zheng MUSIC-based formant frequencies, amplitudes, and bandwidths • Espy-Wilson Acoustic Parameters • sub-band aperiodicity, sonorancy, other targeted measures • Seneff Auditory Model: Mean rate and synchrony • Shamma rate-place-sweep auditory parameters
What are Distinctive Features?(An engineer’s definition) • Distinctive feature = a binary partition of the phonemes • Landmark = Change in the value of an “Articulator-Free Feature” (a.k.a. manner feature) • +speech to –speech, –speech to +speech • consonantal, continuant, sonorant, syllabic • “Articulator-Bound Features” (place and voicing): SVMs are only trained at landmarks • Primary articulator: lips, tongue blade, or tongue body • Features of primary articulator: anterior, strident • Features of secondary articulator: voiced
Landmark Detection using Support Vector Machines (SVMs) False Acceptance vs. False Rejection Errors, TIMIT, per 10ms frame SVM Stop Release Detector: Half the Error of an HMM (1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2% (2) HMM (*): False Rejection Error=0.3% (3) Linear SVM: EER = 0.15% (4) Radial Basis Function SVM: Equal Error Rate=0.13% Niyogi & Burges, 1999, 2002
SVM-detected landmarks are “smoothed” w/dynamic programming • Maximize Pi p( features(ti) | X(ti) ) p(ti+1-ti | features(ti)) • Forced alignment mode: • computes p( word | acoustics ); this is a speech recognizer!! • Soft-decision “smoothing” mode: • p( acoustics | landmarks ) computed, fed to pronunciation model
Place of Articulation:cued bythe WHOLE PATTERN of spectral change over time within 150ms of a landmark
Soft-Decision Distinctive Feature Probabilities Kernel: Transform to Infinite- Dimensional Hilbert Space (SVM Discriminant Dimension = argmin(error(margin)+1/width(margin)) SVM Extracts a Discriminant Dimension (Niyogi & Burges, 2002: Posterior PDF = Sigmoid Model in Discriminant Dimension) An Equivalent Model: Likelihoods = Gaussian in Discriminant Dimension
Soft Decisions once/5ms:p ( manner feature d(t) | Y(t) )p( place feature d(t) | Y(t), t is a landmark ) 2000-dimensional acoustic feature vector SVM Discriminant yi(t) Sigmoid or Histogram Posterior probability of distinctive feature p(di(t)=1 | yi(t))
Distinctive-Feature Based Lexicon • Merger of English Switchboard and Callhome dictionaries • Converted to landmarks using Hasegawa-Johnson’s perl transcription tools Landmarks in blue, Place and voicing features in green. AGO(0.441765) +syllabic+reduced +back AX +–continuant +– sonorant+velar +voiced G closure –+continuant –+sonorant +velar +voiced G release +syllabic–low –high +back +round +tense OW AGO(0.294118) +syllabic+reduced –back IX –+continuant –+sonorant+velar +voiced G closure –+continuant –+sonorant+velar +voiced G release +syllabic–low –high +back +round +tense OW
DBN model: a bit more detail • wordt: word ID at frame #t • wdTrt: word transition? • indti: which gesture, from • the canonical word model, • should articulator i be • trying to implement? • asyncti;j: how asynchronous • are articulators i and j? • Uti: canonical setting of • articulator #i • Sti: surface setting of • articulator #i
DBN-SVM hybrid model Word LIKE A Canonical Form … Tongue closed Tongue Mid Tongue front Tongue open … Surface Form Tongue front Semi-closed Tongue Front Tongue open … Manner Glide Front Vowel Place Palatal … SVM Outputs p( gPGR(x) | palatal glide release) p( gGR(x) | glide release ) x: Multi-Frame Observation including Spectrum, Formants, & Auditory Model …
Acoustic Feature Selection: MFCCs, Formants, Rate-Scale 1. Accuracy per Frame, Stop Releases only, NTIMIT 2. Word Error Rate: Lattice Rescoring, RT03-devel, One Talker (WARNING: this talker is atypical.)Baseline: 15.0% (113/755)Rescoring, place based on: MFCCs + Formant-based params: 14.6% (110/755) Rate-Scale + Formant-based params: 14.3% (108/755)
Evaluation: Lattice Rescoring First-pass system (hybrid ANN-HMM at SRI) produces word lattice Landmark-based recognizer computes a score for each word in the lattice First-pass and second-pass scores linearly combined; best path is computed Recognized Words First-Pass ASR landmark based best oh_and our Word Hypotheses, Times Landmark-Based Acoustic Scores DBN Pronunciation Model Probabilities Landmark Acoustic Model
Lattice Rescoring • Basic Rescoring Method word_score = a * (HMM acoustic model score) + b * (N-gram language model score) + c * (duration scores, word insertion penalties, etc) + d * (landmark-based recognition score) • Recognized word sequence = word sequence with highest Si word_scorei
Maximum Entropy Estimation of Stream Weights for Lattice Rescoring • Lambda parameters estimated by following method: • Expected values implied by the distribution p(hyp|obs) match observed frequencies in training data • Entropy of p(hyp|obs) is maximized, subject to constraint #1
Word Error Rate: Current Results • Training Lattices: Half of the RT03 development test corpus (DARPA/NIST rich text transcription task, 2003) • Baseline word error rate, pinched lattices: 23.5% • Rescored word error rate: 20.0% • WER reduction: 17% relative • Development Test Lattices: Other half of RT03-devel • Baseline word error rate, pinched lattices: 24.1% • Rescored word error rate: 24.1% • WER reduction: 12 words total (not quite 0.1%)
Conclusions • Acoustic modeling: • Target problem:2000-dimensional observation space • Method: regularized learner (SVM) to explicitly control tradeoff between training error & generalization error • Resulting constraints: • Choice of binary distinctions is important: choose distinctive features • Choice of time alignment is important: train place SVMs at landmarks • Lexical modeling: • Target problem: increase flexibility of pronunciation model without over-generating pronunciation variants • Method: factor the probability of pronunciation variants into misalignment & reduction probabilities of 5 hidden articulators • Resulting constraints: • Choice of factors is important: choose articulatory factors • Integration of SVMs into Bayesian model is an interesting problem • Lattice rescoring: • Target problem: integrate word-level side information into a lattice • Method: maximum entropy optimization of stream weights
Future Plans • Further refinement of SVM-DBN hybrid system • Systems intermediate between HMM and SVM-DBN will be developed and tested (e.g., hybrid SVM-HMM systems) • Progressively improved acoustic classifiers will be tested in both MaxEnt and DBN+SVM systems • Maximum entropy lattice rescoring will be tested with prosodic, syntactic, and other word-level side information • Mathematical analysis will study DBN+SVM integration in both training and test
Lattice Pinching & Word Alignment landmark based best oh_and our Pinch to the MAP path: Convert “lattice rescoring problem” into “Choose one of N” problem landmark based oh_and our best Segment 2: onset +lateral? two syllables? coda +body? Segment 3: nucleus +high? Segment 1: