Exploiting Vector Taylor Series for Robust Speech Recognition (CMU ICASSP 1996)

Exploiting Vector Taylor Series for Robust Speech Recognition(CMU ICASSP 1996) Wei-Hau Howard Chen Spoken Language Processing Laboratory National Taiwan Normal University

Reference [1] Pedro J. et al., (1996).“A VECTOR TAYLOR SERIES APPROACH FOR ENVIRONMENT- INDEPENDENT SPEECH RECOGNITION,” in Proc ICAASP. [2] F.-H. Liu, (1994). “ENVORONMENTAL ADAPTATION for Robust Speech Recognition,” CMU, July. [3] L. Neumeyer, and M. Weintraul (1994). “Probabilistic Optimum Filtering for Robust Speech Recognition”. Proc. ICASSP. [4] C. J. Leggetter and P.C. Woodland (1995). “Flexible Speaker Adaptation Using Maximum Likelihood Linear Regression,” Proc. ARPA Spoken Language Systems Technology Workshop, Fanuary. [5] M. Gales and S. Young (1995). “A Fast and Flexible Implementation of Parallel Model Combination”. Proc. ICASSP.

Outline • Introduction • A model of the environment • Description of the VTS algorithm • Statistics of clean and noisy speech • Compensation of noisy speech • The VTS algorithm Reference: Pedro J. et al., “A VECTOR TAYLOR SERIES APPROACH FOR ENVIRONMENT- INDEPENDENT SPEECH RECOGNITION” VTS

Abstract 1 • Introducing a new analytical approach to analytical approach to environment compensation for speech recognition. • Previous attempts at solving analytically the problem of noisy speech recognition in the following two way: • Using an overly-simplified mathematical description of the effect of noise on the statics of speech. • Relying on the availability of large environment-specific adaptation sets. • In this paper, they introduce the use of Vector Taylor series (VTS) expansion to characterize efficiently and accurately the effects on speech statistics of unknown additive noise and unknown linear filtering in a transmission channel.

Abstract 2 • The VTS approach is computationally efficient. • It can be applied to the incoming speech feature vectors and the statistics representing these vectors. • In the first case the speech is compensated and then recognized; in the second case HMM statistics are using the VTS formulation. • Both approaches use only the actual speech segment being recognized to compute the parameters required for environmental compensation.

Introduction 1 • As ASR are becoming more accurate and more sophisticated, robustness to noise, channel, and other environmental effects becomes increasingly important. • Many of these environment compensation algorithms take the advantage of the availability of “stereo data”, i.e. speech databases that are simultaneously recorded in high-quality and degraded environment [2][3]. • Other algorithms make use of non-simultaneously-resorded adaptation data from the degraded environment [4].

Introduction 2 • Still other algorithms [5] sue knowledge of noise statistics and extensive computation to adapt the HMMs of clean speech to a new environment. • Unfortunately, stereo data, a priori knowledge about the test environment, and /or the computational resource requirements of such algorithms are frequently unavailable. • From a practical point of view, algorithms that can compensate for the environmental effects with almost no previous knowledge, and that only require a small segment of the speech signal to compensate, are far more attractive than those that require environment-specific information.

Introduction 3 • The codebook dependent cepstral normalization (CDCN) is an example of the class of model-based algorithms that has been applied with success to several database. • Nevertheless, the CDCN algorithm has some limitations: • It ignores the effects of environment on the variance of speech distribution. • Only limited accuracy is estimated at low SNRs.

Introduction 4 • The VTS algorithm described in this paper address these problems. Specifically, they: • Require only the segment of noisy speech signal to be recognized to perform compensation. • Model the effect of the environment on all the statistics of the PDF of speech. • Provide a unified treatment of the noise and channel re-estimation problem. • Use a better, Gaussian, model for the PDF of the log-spectra of the noise.

A model of the environment 1 • We assume the model of the environment in which speech is corrupted by unknown additive stationary noise and linearly filtered by unknown channel: • where represents the power spectrum of the degraded speech, is the power spectrum of the clean speech, is the transfer function of the linear filter, and is the power spectrum of the additive noise. • In the log-spectrum domain this relation can be expressed as: • or in more general term: where represents the effects of linear filter in the log-spectral domain.

A model of the environment 2 • Moreno first suggested the use of Taylor series to approximate the nonlinearity in . Though he applies it in the spectral instead of the cepstral domain. • Since then, many related techniques have appeared that build upon this core idea. Many of them apply VTS algorithm in cepstral domain. • They mainly differ in how speech and noise are modeled, as well as which assumptions are made in the derivation of the nonlinearity to approximate.

A model of the environment 3 • We also assume that the PDF of the log-spectra of speech signal can be well represented by a summation of multivariate Gaussian distributions: • Furthermore, we assume that the statistics of the noise can be well represented by a single Gaussian . • The problem of compensation is twofold. • First, the parameters , , and need to be determined. • Second, the distribution of given the PDF of and the parameters , , and has to be computed.

A model of the environment 4 • Because of the non-linearity of the function , , both problems are non0trivial. • Only for very simple expressions of the function can be computed analytically.

Description of the VTS algorithm 1 • The key of VTS algorithm is to approximate the generic vector function with a Taylor series approximation: • where is the vector function evaluated at a particular vector point. Similarly, the matrix derivative of the vector function at a particular vector point.

Description of the VTS algorithm 2 • The Taylor expansion is exact everywhere when the order of the Taylor series is infinite. • When has a Gaussian distribution, the function can be expanded around the mean of and the expansion needs to be good only within a relatively narrow region around the mean. • We take advantage of this fact to truncate the Taylor series after just a few terms.

Model speech statistics using VTS • In this figure the mean of the simulated noisy input signal, as well as the mean computed using the Taylor series expansion of order 0 and 2. • As we see, the zeroth-order provides a reasonably good approximation. • However, at lower SNRs the second-order Taylor series expansion provide an even better approximation of the actual distribution.

Statistics of clean and noisy speech 1 • The statistics of clean speech can be modeled as a mixture of Gaussian distributions. The parameters describing these statistics are estimated using basic EM methods. • The goal of the VTS algorithm is to estimate the pdf of noisy speech given the pdf of clean speech, a segment of noisy speech and the Taylor series expansion that relates noisy speech to clean. (feature base) • Alternately, if HMMs are used to describe the pdf of clean speech we can use the Taylor series approach to compute the noisy HMMs and perform recognition on the noisy signal itself. (model base)

Statistics of clean and noisy speech 2 • Recalling that what the VTS attempts to simulate is the non-linear term in the formula . • Zeroth-order VTS expansion(VTS-0): • The VTS-0 expansion of results in a Gaussian distribution for the noisy speech when is Gaussian . • The mean vector and covariance matrices that represent the noisy speech statistics are computed as

Statistics of clean and noisy speech 3 • First-order VTS expansion(VTS-1): • The resulting distribution of is also Gaussian when is Gaussian. The new mean vector is computed as: • In a similar fashion, the new covariance matrix can be expressed as: • where is the variance of the noise.

The VTS algorithm • The process are as follows: • Obtain initial estimates of . • Expand the function around the mean vector of each Gaussian in the distribution of and the estimates of . • estimate the parameters of the distribution of . • Perform a single iteration of the EM algorithm to re-estimate the values of . In the case of VTS-1 is also re-estimated. • If the likelihood of the observed noisy data has not converged, return to Step 2.

Compensation of noisy speech • Once the parameters of the distribution of are computed, an MMSE estimate is used to calculate the clean speech given the observed noisy speech • The results obtained depend on which order Taylor series approximation is used. The zeroth-order approximation produces

Exploiting Vector Taylor Series for Robust Speech Recognition (CMU ICASSP 1996)

Exploiting Vector Taylor Series for Robust Speech Recognition (CMU ICASSP 1996)

Presentation Transcript

Robust Speech recognition

Automatic Speech Recognition

ROBUST SIGNAL REPRESENTATIONS FOR AUTOMATIC SPEECH RECOGNITION

Exploiting Vector Taylor Series for Robust Speech Recognition (CMU ICASSP 1996)

Novel CI- Backoff Scheme for Real-time Embedded Speech Recognition

Speech Recognition with CMU Sphinx

Exploiting domain and task regularities for robust named entity recognition

Relevance Language Modeling For Speech Recognition

Survey of ICASSP 2013 section: feature for robust automatic speech recognition

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition

MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION

Enhanced Speech Models for Robust Speech Recognition

Techniques For Exploiting Unlabeled Data

Discriminatively Trained Region Dependent Feature Transforms for Speech Recognition

Survey of Robust Speech Techniques in ICASSP 2009

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION

On Properties of Modulation Spectrum for Robust Automatic Speech Recognition

PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

CMU Robust Vocabulary-Independent Speech Recognition System

Joint Discriminative Front-End and Back-End Training for Improved Speech Recognition Accuracy

CMU Shpinx Speech Recognition Engine

Prosodic Constraints for Robust Speech Recognition