Acoustic Landmarks and Articulatory Phonology for Automatic Speech Recognition

Acoustic Landmarks and Articulatory Phonology for Automatic Speech Recognition Mark Hasegawa-Johnson Research performed in collaboration with James Baker (Carnegie Mellon), Sarah Borys (Illinois), Ken Chen (Illinois), Emily Coogan (Illinois), Steven Greenberg (Berkeley), Amit Juneja (Maryland), Katrin Kirchhoff (Washington), Karen Livescu (MIT), Srividya Mohan (Johns Hopkins), Jen Muller (Dept. of Defense), Kemal Sonmez (SRI), and Tianyu Wang (Georgia Tech)

What are Landmarks? • Time-frequency regions of high mutual information between phone and signal (maxima of I(q;X(t,f)) ) • Acoustic events with similar importance in all languages, and across all speaking styles • Acoustic events that can be detected even in extremely noisy environments Where do these things happen? • Syllable Onset ≈ Consonant Release • Syllable Nucleus ≈ Vowel Center • Syllable Coda ≈ Consonant Closure I(q;X(t,f)) experiment: Hasegawa-Johnson, 2000

Landmark-Based Speech Recognition Lattice hypothesis: … backed up … Words Times Scores Pronunciation Variants: … backed up … … backtup .. … back up … … backt ihp … … wackt ihp… … ONSET ONSET Syllable Structure NUCLEUS NUCLEUS CODA CODA

Outline • Scientific and Technological Goals • Acoustic Modeling • Speech data and acoustic features • Landmark detection • Estimation of real-valued “distinctive features” using support vector machines (SVM) • Pronunciation Modeling • Dynamic Bayesian network (DBN) • Integration of SVM probability estimates with DBN • Technological Evaluation • Word lattice output from an HMM-based recognizer • Rescoring • Results so far • Future Plans

Scientific and Technological Goals • Acoustic • Learn precise and generalizable models of the acoustic boundary associated with each distinctive feature, • … in an acoustic feature space including representative samples of spectral, phonetic, and auditory features, • … with regularized learners that trade off training corpus error against estimated generalization error in a very-high-dimensional model space • Phonological • Represent a large number of pronunciation variants, in a controlled fashion, by factoring the pronunciation model into distinct articulatory gestures, • … by integrating pseudo-probabilistic soft evidence into a Bayesian network • Technological • A lattice-rescoring pass that reduces word error rate of a speech recognizer

Acoustic Modeling

Speech Databases

Acoustic and Auditory Features • MFCCs • 5ms skip, 25ms window (standard ASR features) • 1ms skip, 4ms window (equivalent to calculation of energy, spectral tilt, and spectral compactness once/millisecond) • Formant frequencies, once/5ms • ESPS LPC-based formant frequencies and bandwidths • Zheng MUSIC-based formant frequencies, amplitudes, and bandwidths • Espy-Wilson Acoustic Parameters • sub-band aperiodicity, sonorancy, other targeted measures • Seneff Auditory Model: Mean rate and synchrony • Shamma rate-place-sweep auditory parameters

What are Distinctive Features?(An engineer’s definition) • Distinctive feature = a binary partition of the phonemes • Landmark = Change in the value of an “Articulator-Free Feature” (a.k.a. manner feature) • +speech to –speech, –speech to +speech • consonantal, continuant, sonorant, syllabic • “Articulator-Bound Features” (place and voicing): SVMs are only trained at landmarks • Primary articulator: lips, tongue blade, or tongue body • Features of primary articulator: anterior, strident • Features of secondary articulator: voiced

Landmark Detection using Support Vector Machines (SVMs) False Acceptance vs. False Rejection Errors, TIMIT, per 10ms frame SVM Stop Release Detector: Half the Error of an HMM (1) Delta-Energy (“Deriv”): Equal Error Rate = 0.2% (2) HMM (*): False Rejection Error=0.3% (3) Linear SVM: EER = 0.15% (4) Radial Basis Function SVM: Equal Error Rate=0.13% Niyogi & Burges, 1999, 2002

SVM-detected landmarks are “smoothed” w/dynamic programming • Maximize Pi p( features(ti) | X(ti) ) p(ti+1-ti | features(ti)) • Forced alignment mode: • computes p( word | acoustics ); this is a speech recognizer!! • Soft-decision “smoothing” mode: • p( acoustics | landmarks ) computed, fed to pronunciation model

Place of Articulation:cued bythe WHOLE PATTERN of spectral change over time within 150ms of a landmark

Soft-Decision Distinctive Feature Probabilities Kernel: Transform to Infinite- Dimensional Hilbert Space (SVM Discriminant Dimension = argmin(error(margin)+1/width(margin)) SVM Extracts a Discriminant Dimension (Niyogi & Burges, 2002: Posterior PDF = Sigmoid Model in Discriminant Dimension) An Equivalent Model: Likelihoods = Gaussian in Discriminant Dimension

Soft Decisions once/5ms:p ( manner feature d(t) | Y(t) )p( place feature d(t) | Y(t), t is a landmark ) 2000-dimensional acoustic feature vector SVM Discriminant yi(t) Sigmoid or Histogram Posterior probability of distinctive feature p(di(t)=1 | yi(t))

Pronunciation Modeling

Distinctive-Feature Based Lexicon • Merger of English Switchboard and Callhome dictionaries • Converted to landmarks using Hasegawa-Johnson’s perl transcription tools Landmarks in blue, Place and voicing features in green. AGO(0.441765) +syllabic+reduced +back AX +–continuant +– sonorant+velar +voiced G closure –+continuant –+sonorant +velar +voiced G release +syllabic–low –high +back +round +tense OW AGO(0.294118) +syllabic+reduced –back IX –+continuant –+sonorant+velar +voiced G closure –+continuant –+sonorant+velar +voiced G release +syllabic–low –high +back +round +tense OW

Dynamic Bayesian Network model of pronunciation variability

DBN model: a bit more detail • wordt: word ID at frame #t • wdTrt: word transition? • indti: which gesture, from • the canonical word model, • should articulator i be • trying to implement? • asyncti;j: how asynchronous • are articulators i and j? • Uti: canonical setting of • articulator #i • Sti: surface setting of • articulator #i

DBN-SVM hybrid model Word LIKE A Canonical Form … Tongue closed Tongue Mid Tongue front Tongue open … Surface Form Tongue front Semi-closed Tongue Front Tongue open … Manner Glide Front Vowel Place Palatal … SVM Outputs p( gPGR(x) | palatal glide release) p( gGR(x) | glide release ) x: Multi-Frame Observation including Spectrum, Formants, & Auditory Model …

Evaluation

SVM Training: Accuracy, per frame, in percent

Acoustic Feature Selection: MFCCs, Formants, Rate-Scale 1. Accuracy per Frame, Stop Releases only, NTIMIT 2. Word Error Rate: Lattice Rescoring, RT03-devel, One Talker (WARNING: this talker is atypical.)Baseline: 15.0% (113/755)Rescoring, place based on: MFCCs + Formant-based params: 14.6% (110/755) Rate-Scale + Formant-based params: 14.3% (108/755)

Evaluation: Lattice Rescoring First-pass system (hybrid ANN-HMM at SRI) produces word lattice Landmark-based recognizer computes a score for each word in the lattice First-pass and second-pass scores linearly combined; best path is computed Recognized Words First-Pass ASR landmark based best oh_and our Word Hypotheses, Times Landmark-Based Acoustic Scores DBN Pronunciation Model Probabilities Landmark Acoustic Model

Sample Word Lattice

Lattice Rescoring • Basic Rescoring Method word_score = a * (HMM acoustic model score) + b * (N-gram language model score) + c * (duration scores, word insertion penalties, etc) + d * (landmark-based recognition score) • Recognized word sequence = word sequence with highest Si word_scorei

Maximum Entropy Estimation of Stream Weights for Lattice Rescoring • Lambda parameters estimated by following method: • Expected values implied by the distribution p(hyp|obs) match observed frequencies in training data • Entropy of p(hyp|obs) is maximized, subject to constraint #1

Word Error Rate: Current Results • Training Lattices: Half of the RT03 development test corpus (DARPA/NIST rich text transcription task, 2003) • Baseline word error rate, pinched lattices: 23.5% • Rescored word error rate: 20.0% • WER reduction: 17% relative • Development Test Lattices: Other half of RT03-devel • Baseline word error rate, pinched lattices: 24.1% • Rescored word error rate: 24.1% • WER reduction: 12 words total (not quite 0.1%)

Conclusions • Acoustic modeling: • Target problem:2000-dimensional observation space • Method: regularized learner (SVM) to explicitly control tradeoff between training error & generalization error • Resulting constraints: • Choice of binary distinctions is important: choose distinctive features • Choice of time alignment is important: train place SVMs at landmarks • Lexical modeling: • Target problem: increase flexibility of pronunciation model without over-generating pronunciation variants • Method: factor the probability of pronunciation variants into misalignment & reduction probabilities of 5 hidden articulators • Resulting constraints: • Choice of factors is important: choose articulatory factors • Integration of SVMs into Bayesian model is an interesting problem • Lattice rescoring: • Target problem: integrate word-level side information into a lattice • Method: maximum entropy optimization of stream weights

Future Plans • Further refinement of SVM-DBN hybrid system • Systems intermediate between HMM and SVM-DBN will be developed and tested (e.g., hybrid SVM-HMM systems) • Progressively improved acoustic classifiers will be tested in both MaxEnt and DBN+SVM systems • Maximum entropy lattice rescoring will be tested with prosodic, syntactic, and other word-level side information • Mathematical analysis will study DBN+SVM integration in both training and test

Lattice Pinching & Word Alignment landmark based best oh_and our Pinch to the MAP path: Convert “lattice rescoring problem” into “Choose one of N” problem landmark based oh_and our best Segment 2: onset +lateral? two syllables? coda +body? Segment 3: nucleus +high? Segment 1:

Pinched Lattice

Acoustic Landmarks and Articulatory Phonology for Automatic Speech Recognition

Acoustic Landmarks and Articulatory Phonology for Automatic Speech Recognition

Presentation Transcript

Automatic Speech Recognition

Phonology and the art of Automatic Speech Recognition

Automatic Speech Recognition

Automatic Speech Recognition

Automatic Speech Recognition (ASR)

Automatic speech recognition

Automatic Speech Recognition II

Automatic Speech Recognition System

Articulatory Feature-Based Speech Recognition

Automatic Speech Recognition

Articulatory Feature-Based Speech Recognition

Automatic Speech Recognition Studies

Articulatory Talking Head driven by Automatic Speech Recognition INRIA, Parole Team

Automatic Speech Recognition Introduction

Automatic Speech Recognition

Articulatory Feature-Based Speech Recognition

Automatic Speech Recognition Introduction

Acoustic Modeling for Speech Recognition

Automatic Speech Recognition Introduction

Automatic Speech Recognition

Articulatory Feature-Based Speech Recognition

Speech Information at Acoustic Landmarks