LSA 352 Speech Recognition and Synthesis

LSA 352Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides were derived from Andrew Ng’s CS 229 notes, as well as lecture notes from Chen, Picheny et al, Yun-Hsuan Sung, and Bryan Pellom. I’ll try to give correct credit on each slide, but I’ll prob miss some.

MFCC • Mel-Frequency Cepstral Coefficient (MFCC) • Most widely used spectral representation in ASR

Windowing • Why divide speech signal into successive overlapping frames? • Speech is not a stationary signal; we want information about a small enough region that the spectral information is a useful cue. • Frames • Frame size: typically, 10-25ms • Frame shift: the length of time between successive frames, typically, 5-10ms

Windowing Slide from Bryan Pellom

MFCC

Discrete Fourier Transform computing a spectrum • A 24 ms Hamming-windowed signal • And its spectrum as computed by DFT (plus other smoothing)

MFCC

Mel-scale • Human hearing is not equally sensitive to all frequency bands • Less sensitive at higher frequencies, roughly > 1000 Hz • I.e. human perception of frequency is non-linear:

Mel-scale • A mel is a unit of pitch • Definition: • Pairs of sounds perceptually equidistant in pitch • Are separated by an equal number of mels: • Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz • Definition:

Mel Filter Bank Processing • Mel Filter bank • Uniformly spaced before 1 kHz • logarithmic scale after 1 kHz

MFCC

Log energy computation • Compute the logarithm of the square magnitude of the output of Mel-filter bank

Log energy computation • Why log energy? • Logarithm compresses dynamic range of values • Human response to signal level is logarithmic • humans less sensitive to slight differences in amplitude at high amplitudes than low amplitudes • Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker’s mouth moving closer to mike) • Phase information not helpful in speech

MFCC

Dynamic Cepstral Coefficient • The cepstral coefficients do not capture energy • So we add an energy feature • Also, we know that speech signal is not constant (slope of formants, change from stop burst to release). • So we want to add the changes in features (the slopes). • We call these delta features • We also add double-delta acceleration features

Delta and double-delta • Derivative: in order to obtain temporal information

Typical MFCC features • Window size: 25ms • Window shift: 10ms • Pre-emphasis coefficient: 0.97 • MFCC: • 12 MFCC (mel frequency cepstral coefficients) • 1 energy feature • 12 delta MFCC features • 12 double-delta MFCC features • 1 delta energy feature • 1 double-delta energy feature • Total 39-dimensional features

Why is MFCC so popular? • Efficient to compute • Incorporates a perceptual Mel frequency scale • Separates the source and filter • IDFT(DCT) decorrelates the features • Improves diagonal assumption in HMM modeling • Alternative • PLP

…and now what? • What we have = what we OBSERVE • A MATRIX OF OBSERVATIONS: O = [O1, O2, ……………….OT]

Multivariate Gaussians • Vector of means  and covariance matrix  • M mixtures of Gaussians:

Gaussian Intuitions: Off-diagonal • As we increase the off-diagonal entries, more correlation between value of x and value of y Text and figures from Andrew Ng’s lecture notes for CS229

Diagonal covariance • Diagonal contains the variance of each dimension ii2 • So this means we consider the variance of each acoustic feature (dimension) separately

Gaussian Intuitions: Size of  •  = [0 0]  = [0 0]  = [0 0] •  = I  = 0.6I  = 2I • As  becomes larger, Gaussian becomes more spread out; as  becomes smaller, Gaussian more compressed Text and figures from Andrew Ng’s lecture notes for CS229

Observation “probability” How “well” and observation “fits” into a given state Example: Obs. 3 in State 2 Text and figures from Andrew Ng’s lecture notes for CS229

Gaussian PDFs • A Gaussian is a probability density function; probability is area under curve. • To make it a probability, we constrain area under curve = 1. • BUT… • We will be using “point estimates”; value of Gaussian at point. • Technically these are not probabilities, since a pdf gives a probability over an interval, needs to be multiplied by dx • BUT….this is ok since same value is omitted from all Gaussians (we will use argmax, it is still correct).

Gaussian as Probability Density Function

Gaussian as Probability Density Function Point estimate

Using Gaussian as an acoustic likelihood estimator EXAMPLE: • Let’s suppose our observation was a single real-valued feature (instead of 33 dimension vector) • Then if we had learned a Gaussian over the distribution of values of this feature • We could compute the likelihood of any given observation ot as follows:

…what about transition probabilities?

So, how to train a HMM ?? HMM =l First decide the topology: number of sates, transitions,..

In our HTK example (HMM prototypes) • File name MUST be equal to the linguistic unit: si , no and sil • We will use: • 12 states ( 14 in HTK: 12 + 2 start/end) • and two Gaussians (mixtures) per state… • …..with dimension 33.

In our HTK example (HMM prototypes) So, for example, the file named si will be: ~o <vecsize> 33 <MFCC_0_D_A> ~h "si" <BeginHMM> <Numstates> 14 <State> 2 <NumMixes> 2 <Stream> 1 <Mixture> 1 0.6 <Mean> 33 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <Mixture> 2 0.4 <Mean> 330.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <Variance> 331.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 <State> 3 <NumMixes> 2 <Stream> 1 ......................................

In our HTK example (HMM prototypes) ...................................... <Transp> 14x14 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 <EndHMM>

…but, how do we train a HMM ?? HMM =l for a particular linguistic unit Train = “estimate” transition and observation probabilities (weights, means and variances)

…but, how do we train a HMM ?? We have some training data for each linguistic unit

…but, how do we train a HMM ?? Example for a (single) Gaussian characterized by a mean and a variance

…but, how do we train a HMM ??

Uniform segmentation is used by HInit OUTPUT HMMs

…what is the “hidden” sequence of sates ?? • GIVEN: • a HMM =l • and “a sequences of states” S

…what is the “hidden” sequence of sates ?? Can be easily obtained

…what is the “hidden” sequence of sates ?? ..and by adding all possible sequences S

…what is the “hidden” sequence of sates ?? ..and by adding all possible sequences S (Baum-Welch)

…what is the “hidden” sequence of sates ??

…so with the optimum sequence of states on the training data.. transition and observation probabilities (weights, means and variances) can be re-trained … UNTIL CONVERGENCE

HMM training using HRest OUTPUT HMMs

…training is usually done using Baum-Welch Simplifying (too much..?) adaptation formulas weight observations by their probability to occupy a particular state (occupation probability) Weighted by occupation probability and Normalize HTKBook

…why not to train all HMM together? …like a global HMM … In our example, in each file we have:

HMM training using HERest OUTPUT HMMs

LSA 352 Speech Recognition and Synthesis

LSA 352 Speech Recognition and Synthesis

Presentation Transcript

Speech Recognition

CS 224S LING 281 Speech Recognition and Synthesis

CS 224S LING 281 Speech Recognition and Synthesis

LSA 352: Speech Recognition and Synthesis

SPEECH PRODUCTION,RECOGNITION, ANALYSIS, AND SYNTHESIS

Speech recognition

3. SPEECH RECOGNITION, ANALYSIS, AND SYNTHESIS

3. SPEECH RECOGNITION, ANALYSIS, AND SYNTHESIS

Speech Recognition

Speech Recognition

CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue

CS 224S LING 281 Speech Recognition and Synthesis

Application of Speech Recognition, Synthesis, Dialog

Chapter 28 – Multimedia: Audio, Video, Speech Synthesis and Recognition

CS774 Topics in Artificial Intelligence —Speech Recognition and Synthesis --

LSA 352: Summer 2007. Speech Recognition and Synthesis

LSA 352 Speech Recognition and Synthesis

LSA 352: Speech Recognition and Synthesis

LSA 352: Speech Recognition and Synthesis