in

Incorporating Acoustic Context in Automatic Speech Recognition Nelson Morgan, ICSI and UC Berkeley

Talk Outline • What I mean by “Acoustic Context” • Dynamic Features • Temporal Filtering • Multi-frame analysis • Multi-stream analysis • Hidden Dynamic Models • Conclusions

1952 Bell Labs Digits • Possibly the first word (digit) recognizer • Approximated energy in formants over word • Insensitive to amplitude, timing variation • Clumsy technology

Digit Patterns Axis Crossing Counter HP filter (1 kHz) (kHz) 3 Limiting Amplifier Spoken 2 Digit 1 200 800 (Hz) Axis Crossing Counter LP filter (800 Hz) Limiting Amplifier

Processing the Streams • Hard limiting and counting for each • Simple form of acoustic context: evaluating features for entire word • Replaced by framewise analysis

Framewise Analysis of Speech Frame 1 Frame 2 Feature VectorX1 Feature VectorX2

Acoustic Context • Evaluating costs based on observationsover time intervals greater than the“typical” frame (20-30 ms) • Not the same as phonetic context(models for sound units in particularcontexts)

Statistical ASR • i_best = argmax P(M |X ) = argmax P(X|M ) P(M ) • P(X|M )  P(X|Q ) [Viterbi approx.]where Q is the best state sequence in M • approximated by product of local likelihoods (Markov,conditional independence assumptions) i i i i i M i i M i

1 2 1 2 1 1 1 2 1 2 2 Markov model q q 1 2 P(x ,x |q ,q )  P( q ) P(x |q ) P(q | q ) P(x | q )

Markov model (graphical form) q q q q 1 2 3 4 x x x x 1 2 3 4

Beyond a single frame Acoustic context can be increased beyond the single frame by: • Temporal processing before frame analysis • Multiple frames used in observation • Both mechanisms used together

Spectral vs Temporal Processing Analysis (e.g., cepstral) frequency Spectral processing Time Processing (e.g., mean removal) frequency Temporal processing

Dynamic Speech Features • temporal dynamics useful for ASR • local time derivatives of cepstra • “delta’’ features estimated over multiple frames (typically 5) • usually augments static features • can be viewed as a temporal filter

“Delta” impulse response .2 .1 0 -2 -1 0 1 2 frames -.1 -.2

Temporal Filtering • Filtering of short term spectral or cepstral components • Simple noncausal case: mean removal • Generalization: bandpass or highpass filter (e.g., RASTA) • Data-driven design: LDA -> filters

Linear Discriminant Analysis (1D) X 2 X 1

Linear Discriminant Analysis (multi-D) x 1 x 2 y 1 = X x y 3 2 x 4 Transformation to maximize ratio: between-class variance x 5 within-class variance

LDA for Temporal FilterDesign Single variable over time x 1 x 2 y y 1 1 = X x y 3 2 x 4 x 5

LDA-based temporal filters

Multi-frame analysis • Incorporate multiple frames as a single observation • LDA the most common approach • Neural networks • Bayesian networks (graphical models, including Buried Markov Models)

LDA for Multiple FrameTransformation All variables for several frames x 1 x 2 y 1 = X x y 3 2 x 4 x 5

Multi-layer perceptron

Buried Markov Models

Multi-stream analysis • Multi-band systems • Multiple temporal properties • Multiple data-driven temporal filters

Multi-band analysis

Temporally distinct features

Combining streams

Another context model: articulators • Natural representation of context • Production apparatus has mass, inertia • Difficult to accurately model • Can approximate with simple dynamics

Hidden Dynamic Models “We hold these truths to be self-evident: that speech is produced by an underlying dynamic system, that it is endowed by its production system with certain inherent dynamic qualities, among these are compactness, continuity, and the pursuit of target values for each phone class, that to exploit these characteristics Hidden Dynamic Models are instituted among men. We … solemnly publish and declare, that these phone classes are and of aright ought to be free and context independent states …And for the support of this declaration, with a firm reliance on the acoustic theory of speech production, we mutually pledge our lives, our fortunes, and our sacred honor.” John Bridle and Li Deng, 1998 Hopkins Spoken LanguageWorkshop, with apologies to Thomas Jefferson ... (See http://www/clsp.jhu.edu/ws98/projects/dynamic/)

Hidden Dynamic Models SEGMENTATION TARGET SWITCH TARGET VALUES FILTER NEURAL NETWORK SPEECH PATTERN

Sources of Optimism • Comparatively new research lines • Many examples of improvements • Moore’s Law  much more processing • Points toward joint development of front end and statistical components

Summary • Acoustic context is already incorporatedin simple forms of temporal signalprocessing (CMS, deltas) • Generalized forms can help more • Study of the interaction betweentraditional ASR strata may help indesign of this component

in

in

Presentation Transcript

In Haiti in 2010…

In Fashion (In Style)

In-

In Fashion (In Style)

In-

hand in hand in

IN:

IN:

Settling in, in Y10

In Europa in inglese

IN.1010, IN.0111, IN.0114, IN.1910, IN.0300

In-

IN

IN+

IN.1010, IN.0111, IN.0114, IN.1910, IN.0300

In English... In Hebrew...

IN:

In In the Beginning…