320 likes | 469 Views
Incorporating Acoustic Context. in. Automatic Speech Recognition. Nelson Morgan, ICSI and UC Berkeley. Talk Outline. What I mean by “Acoustic Context” Dynamic Features Temporal Filtering Multi-frame analysis Multi-stream analysis Hidden Dynamic Models Conclusions.
E N D
Incorporating Acoustic Context in Automatic Speech Recognition Nelson Morgan, ICSI and UC Berkeley
Talk Outline • What I mean by “Acoustic Context” • Dynamic Features • Temporal Filtering • Multi-frame analysis • Multi-stream analysis • Hidden Dynamic Models • Conclusions
1952 Bell Labs Digits • Possibly the first word (digit) recognizer • Approximated energy in formants over word • Insensitive to amplitude, timing variation • Clumsy technology
Digit Patterns Axis Crossing Counter HP filter (1 kHz) (kHz) 3 Limiting Amplifier Spoken 2 Digit 1 200 800 (Hz) Axis Crossing Counter LP filter (800 Hz) Limiting Amplifier
Processing the Streams • Hard limiting and counting for each • Simple form of acoustic context: evaluating features for entire word • Replaced by framewise analysis
Framewise Analysis of Speech Frame 1 Frame 2 Feature VectorX1 Feature VectorX2
Acoustic Context • Evaluating costs based on observationsover time intervals greater than the“typical” frame (20-30 ms) • Not the same as phonetic context(models for sound units in particularcontexts)
Statistical ASR • i_best = argmax P(M |X ) = argmax P(X|M ) P(M ) • P(X|M ) P(X|Q ) [Viterbi approx.]where Q is the best state sequence in M • approximated by product of local likelihoods (Markov,conditional independence assumptions) i i i i i M i i M i
1 2 1 2 1 1 1 2 1 2 2 Markov model q q 1 2 P(x ,x |q ,q ) P( q ) P(x |q ) P(q | q ) P(x | q )
Markov model (graphical form) q q q q 1 2 3 4 x x x x 1 2 3 4
Beyond a single frame Acoustic context can be increased beyond the single frame by: • Temporal processing before frame analysis • Multiple frames used in observation • Both mechanisms used together
Spectral vs Temporal Processing Analysis (e.g., cepstral) frequency Spectral processing Time Processing (e.g., mean removal) frequency Temporal processing
Dynamic Speech Features • temporal dynamics useful for ASR • local time derivatives of cepstra • “delta’’ features estimated over multiple frames (typically 5) • usually augments static features • can be viewed as a temporal filter
“Delta” impulse response .2 .1 0 -2 -1 0 1 2 frames -.1 -.2
Temporal Filtering • Filtering of short term spectral or cepstral components • Simple noncausal case: mean removal • Generalization: bandpass or highpass filter (e.g., RASTA) • Data-driven design: LDA -> filters
Linear Discriminant Analysis (1D) X 2 X 1
Linear Discriminant Analysis (multi-D) x 1 x 2 y 1 = X x y 3 2 x 4 Transformation to maximize ratio: between-class variance x 5 within-class variance
LDA for Temporal FilterDesign Single variable over time x 1 x 2 y y 1 1 = X x y 3 2 x 4 x 5
Multi-frame analysis • Incorporate multiple frames as a single observation • LDA the most common approach • Neural networks • Bayesian networks (graphical models, including Buried Markov Models)
LDA for Multiple FrameTransformation All variables for several frames x 1 x 2 y 1 = X x y 3 2 x 4 x 5
Multi-stream analysis • Multi-band systems • Multiple temporal properties • Multiple data-driven temporal filters
Another context model: articulators • Natural representation of context • Production apparatus has mass, inertia • Difficult to accurately model • Can approximate with simple dynamics
Hidden Dynamic Models “We hold these truths to be self-evident: that speech is produced by an underlying dynamic system, that it is endowed by its production system with certain inherent dynamic qualities, among these are compactness, continuity, and the pursuit of target values for each phone class, that to exploit these characteristics Hidden Dynamic Models are instituted among men. We … solemnly publish and declare, that these phone classes are and of aright ought to be free and context independent states …And for the support of this declaration, with a firm reliance on the acoustic theory of speech production, we mutually pledge our lives, our fortunes, and our sacred honor.” John Bridle and Li Deng, 1998 Hopkins Spoken LanguageWorkshop, with apologies to Thomas Jefferson ... (See http://www/clsp.jhu.edu/ws98/projects/dynamic/)
Hidden Dynamic Models SEGMENTATION TARGET SWITCH TARGET VALUES FILTER NEURAL NETWORK SPEECH PATTERN
Sources of Optimism • Comparatively new research lines • Many examples of improvements • Moore’s Law much more processing • Points toward joint development of front end and statistical components
Summary • Acoustic context is already incorporatedin simple forms of temporal signalprocessing (CMS, deltas) • Generalized forms can help more • Study of the interaction betweentraditional ASR strata may help indesign of this component