Discriminatively Trained Region Dependent Feature Transforms for Speech Recognition

Discriminatively Trained Region Dependent Feature Transforms for Speech Recognition Bing Zhang, Spyros Matsoukas, Richard Schwartz BBN Technologies ICASSP 2006 Presented by Shih-Hung Liu 2006/06/05

Outline • Introduction • MPE-HLDA and fMPE reviewed • Region dependent transform • Corpus • Experimental results • Conclusions

Introduction • Discriminative feature optimization has been a research focus in the last few years • Various works have been published, which share the idea of training a feature transform under discriminative criteria such as MPE, MMI, and MCE ,(MBRDT) • By using discriminative criteria, the feature optimization is better correlated with the reduction of recognition errors, hence offers better accuracy than standard feature transforms like LDA+MLLT or HLDA

Introduction • In MPE-HLDA, a global linear projection is optimized, selecting compact features from concatenated cepstral coefficients across several frames (long span features) • Being a linear projection, it offers moderate gains over LDA+MLLT but is quite limited

Introduction • Feature space MPE (fMPE) is a nonlinear transform also trained using MPE criterion • A global Gaussian mixture model (GMM) is trained, and the posterior vectors of those Gaussians are projected into a lower dimensional space, in order to correct some predefined features • The idea of using Gaussian posteriors in fMPE is not new. A similar algorithm called SPLICE has been used previously for noise compensation

Introduction • In this work, we combine the idea of piece-wise transforms, as in fMPE and SPLICE, together with the idea of using long span feature projections, as in MPE-HLDA, forming a kind of more general transformation, which is referred to as region dependent transform (RDT) • In RDT, either a linear or a nonlinear feature transform can be used in each region • As a special case of RDT, a linear projection of long span features is used for each region. It is referred to as region dependent linear transform (RDLT)

MPE-HLDA Reviewed • Both MPE-HLDA and fMPE aims at optimizing the objective function of MPE. • The feature transform in MPE-HLDA is a global linear projection: where is long span feature obtained by concatenating several frames of cepstral features from to

fMPE Reviewed • In fMPE, the feature transform is where is a projection matrix, and is an N-dimensional vector of Gaussian posterior probabilities, computed using a GMM trained in the same space of , is some low dimensional feature vector Typically is formed by linearly projecting

fMPE Reviewed • fMPE feature transform can be rewritten as where denotes the ith column of M, and is theith element of • fMPE can be viewed as weighted sum of some feature vectors, each obtained by shifting by a constant term • Since the posterior tells which Gaussian the frame is close to, fMPE can be viewed as a region dependent feature correction function. Here the division of regions in the acoustic space is performed by the GMM

Region Dependent Transform • It seems at a first glance that two very different function are used in MPE-HLDA and fMPE • We rewrite MPE-HLDA in a similar form as fMPE • From above, it is natural to come up with a generalized form of region dependent linear transform (RDLT) in which both linear projection and bias are specific to region (or equivalently to Gaussian )

Region Dependent Transform • Conceptually, can be further generalized by removing the linear constraint, which leads to the general region dependent transform as • In this paper, we will only focus on the regionally linear transform where is a vector-to-vector mapping function whose parameters depend on region

Region Dependent Transform • As variations of RDLT, one can think of using parameter tying for , since it is possible that fewer distinctive projections are really needed than biases • The tying could be implemented by clustering the Gaussians in the GMM into fewer groups, as we normally do for fast Gaussian computation • Within each group, parameters of the projections could be shared. The parameter tied RDLT (tRDLT) can be express as

Corpus • Training corpus • 2300hrs EARS R04 CTS • 370hrs Switchboard and Callhome data • 1970hrs Fisher data • Testing data • RT03 evaluation set (Eval03) • 3hrs Switchboard-II and 3hrs Fisher data • RT04 development set (Dev04) • 3hrs Fisher data

Experimental Results • Projections vs. Offsets MPE-HLDA fMPE without context expansion

Experimental Results

Conclusions • We have introduced the concept of region dependent transform by extending MPE-HLDA with the idea of dividing the acoustic space into regions and having a feature transform for each region • Showing that both fMPE and MPE-HLDA are special cases of the region dependent linear transform

Discriminatively Trained Region Dependent Feature Transforms for Speech Recognition

Discriminatively Trained Region Dependent Feature Transforms for Speech Recognition

Presentation Transcript

Distinctive Feature Detection For Automatic Speech Recognition

Speech Recognition

Tandem Connectionist Feature Extraction for Conversational Speech Recognition

Feature Generation: Linear Transforms

Using Speech Recognition for Speech Therapy

Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition

A Discriminatively Trained, Multiscale , Deformable Part Model

Speech recognition

Large-Margin Feature Adaptation for Automatic Speech Recognition

Contourlet Transforms For Feature Detection

Speech Recognition

Speech Recognition

Discriminative Feature Optimization for Speech Recognition

Linear Discriminant Feature Extraction for Speech Recognition

Articulatory Feature-Based Speech Recognition

Articulatory Feature-Based Speech Recognition

Object Detection with Discriminatively Trained Part Based Models

SPEECH RECOGNITION:

Single and Multi Channel Feature Enhancement for Distant Speech Recognition

Speech Recognition

Articulatory Feature-Based Speech Recognition

Articulatory Feature-Based Speech Recognition