150 likes | 266 Views
IRCS/CCN Summer Workshop June 2003 Speech Recognition. Why is perception hard?. Task: available signals → model of the world around signals are mostly accidental, inadequate sometimes disguised or falsified always mixed-up and ambiguous Reasoning about the source of signals:
E N D
Why is perception hard? • Task: available signals → model of the world around • signals are mostly accidental, inadequate • sometimes disguised or falsified • always mixed-up and ambiguous • Reasoning about the source of signals: • Integration of context: what do you expect? • “Sensor fusion”: integration of vision, sound, smell etc. • Source (and noise) separation: there’s more than one thing out there • Variable perspective, source variation etc. • depends on the type of signal • depends on the type of object • Much harder than chess or calculus!
Bayesian probability estimation • Thomas Bayes (1702-1761) • Minister of the Presbyterian Chapel at Tunbridge Wells • Amateur mathematician • Essay towards solving a problem in the doctrine of chances,published (posthumously) in 1764 • Crucial idea: background (prior) knowledge about the plausibility of different theoriescan be combined with knowledge aboutthe relation of theories to evidence • in a mathematically well-defined way • even if all knowledge is uncertain • to reason about the most likely explanation of the available evidence • Bayes’ theorem • “the most important equation in the history of mathematics” (?) • a simple consequence of basic definitions, or • a still-controversial recipe for the probability of alternative causes for a given event, or • the implicit foundation of human reasoning • a general framework for solving the problems of perception Tutorial on Bayes’ Theorem
Fundamental theoremof speech recognition P(W|S) ∝ P(S|W)P(W) where W is “Word(s)” (i.e. message text) S is “Sound(s)” (i.e. speech signal) “Noisy channel model” of communications engineeringdue to Shannon 1949 New algorithms, especially relevant to speech recognition due to L.E. Baum et al. ~ 1965-1970 Applied to speech recognition by Jim Baker (CMU PhD 1975), Fred Jelinek (IBM speech group >>1975)
Motivations for a Bayesian approach • A consistent framework for integrating previous experience and current evidence • A quantitative model for “abduction” = reasoning about the best explanation • A general method for turning a generative model into an analytic one = “analysis by synthesis” helpful where |categories| << |signals| These motivations apply both in engineering practice and in the evolution of biological systems
Basic architecture of standard speech recognition technology 1. Bayes’ Rule: P(W|S) ∝ P(S|W)P(W) 2. Approximate P(S|W)P(W) as a Hidden Markov Model a probabilistic function [ to get P(S|W)] of a markov chain [ to get P(W) ] 3. Use Baum/Welch (=EM) algorithm to “learn” HMM parameters 4. Use Viterbi decoding to find the most probable W given S in terms of the estimated HMM
HMM parameter estimation given labelled/aligned training data...
Viterbi decoding given HMM & observed signal...
Sketch of Baum-Welch (EM) algorithm for estimating HMM parameters given unaligned (or even unlabelled) training data
Other typical details:Complex elaborations of the basic ideas • HMM states ← triphones ← words • each triphone → 3-5 states + connection pattern • phone sequence from pronuncing dictionary • clustering for estimation • Acoustic features • RASTA-PLP etc. • Vocal tract length normalization, speaker clustering • Output pdf for each state as mixture of gaussians • Language model as N-gram model over words • recency/topic effects • Empirical weighting of language vs. acoustic models • etc. etc.
Some limitations of the standard architecture • Problems with Markovian assumptions • Modeling trajectory effects • Variable coordination of articulatory dimensions • ....