Enhancing Speech Recognition with Acoustic and Lexical Correlates of Pitch Accent

Improved Speech Recognition using Acoustic and Lexical Correlates of Pitch Accent in a N-Best Rescoring Framework S. Ananthaskrishnan and S. Narayanan Department of EE, Southern California ICASSP 2007 Reporter: Shih-Hung Liu 2007/05/14

Outline • Introduction • Data Corpus and Baseline ASR • Prosody model • Acoustic-prosodic model • De-lexicalized prosody sequence model • Lexical prosody model • Experimental results • Conclusions

Introduction • Most statistical speech recognition systems make use of segment-level features, derived mainly from spectral envelope characteristics of the signal, but ignore supra-segmental cues that carry additional information likely to be useful for speech recognition • These cues, which constitute the prosody of the utterance and occur at the syllable, word and utterance level, are closely related to the lexical and syntactic organization of the utterance • In this paper, we explore the use of acoustic and lexical correlates of a subset of these cues in order to improve recognition performance on a read-speech corpus

Data Corpus • The Boston University Radio News Corpus (BU-RNC) consists of about 3 hours of read speech with 6 speakers (3 female, 3 male). • We use this corpus because it contains prosodic annotations in the form of ToBI-style labels for pitch accents, phrase boundaries and lexical break indices

Baseline ASR • We used the University of Colorado SONIC continuous speech recognizer to develop the baseline ASR • We adapted context-dependent triphone acoustic models from the Wall Street Journal task with data from the training partitions of the BU-RNC using the tree-based MAPLR algorithm • We used PMVDR (Perceptual Minimum Variance Distortionless Response) features derived from the acoustic signal to train these models • A standard back-off trigram language model with Kneser-Ney smoothing was trained with a mixture of text from the WSJ, HUB-4 and BU datasets

Prosody model • We augment the standard ASR equation to include prosodic information as follows:

Prosody model • Based on conditional independence assumptions Acoustic-prosodic model Lexical prosody model De-lexicalized prosody sequence model

Acoustic-prosodic model • acoustic-prosodic features that make up Ap include: • 1. F0: F0-range features (max-min, max-avg, avg-min) • 2. Energy: within-syllable energy range features (maxmin, avg-min) • 3. Timing: syllable nucleus duration • These features were normalized to minimize effects of speaker- or nucleus-specific variation. • The model is trained as a feedforward neural network (MLP) with 8 input nodes, 25 hidden nodes and 2 output nodes with softmax activation, with outputs interpreted as posterior probabilities

De-lexicalized prosody sequence model • The term p(P) establishes constraints on the sequence of pitch accent events P • Since P has a binary vocabulary, it was robustly estimated from small amounts of training data • We modeled this component as a 4-gram back-off language model with pitch accent labels obtained from the training data

Lexical prosody model • Since we built prosody models at the syllable level, we first decomposed the sequence of words into the corresponding sequence of syllables S using the syllabifier • We have previously shown that these canonical stress labels exhibit high correlation with pitch accents. • This provided us with another stream of features L. The lexical prosody sequence model then becomes p(P|W) = p(P|S,L)

Experimental results

Conclusions • In this paper, we presented a N-best re-ranking scheme using a prosody model that was decoupled from the main ASR system. • The re-ranking method achieved a modest but significant reduction in WER of 1.3% (relative) compared to the baseline recognition system.

Maximum Entropy Confidence Estimation for Speech Recognition C. White, JHU J. Droppo, A. Acero, J. Odell, Microsoft Research ICASSP 2007 Reporter: Shih-Hung Liu 2007/05/14

Outline • Introduction • Baseline System • Data set • Observation Selection • GMM Baseline • Maximum Entropy System • A Simple ME System • Improved Results with Binning • Quadratic Observation Vector • Incorporating Augmented Features • Experiments • Conclusions

Introduction • For many automatic speech recognition (ASR) applications, it is useful to predict the likelihood that the recognized string contains an error • The standard confidence estimation design consists of a classifier that predicts the probability of error using several observations taken from the recognition lattice emitted by the ASR engine • If a rich lattice is available, it can be renormalized to provide a good confidence estimate

Introduction • The first improvement provides significant gains in overall accuracy, as well as good generalization behavior. This is accomplished with the introduction of a maximum entropy classifier • The second improvement allows the system to provide good confidence estimates, even when a rich recognition lattice is not available • The solution presented here is to produce alternate features designed to contain information similar to what has been pruned from the lattice

Baseline System • Our goal was to build a system that generates good confidence estimates. This means that it should work transparently across a variety of recognition grammars. • It should be robust to duration, speaker, channel, and other irrelevant factors • We merged existing data to construct a new corpus. • It contains over 250,000 utterances pulled from source corpora covering different acoustic channels, additive noise, and accents

Observation Selection • with lattice features denoted with an *, and augmented-set features denoted with a **. Features used in the ‘Unq’ case are denoted with a ‘U’, ‘Alt’ with a ‘A’

GMM Baseline • The baseline system consists of two GMMs, one that models correctly recognized utterances (c) and one that models incorrectly recognized utterances (i) • Both c and i models use a full covariance matrix and have been trained using the expectation maximization (EM) algorithm

Maximum Entropy System • Our model is of the form p(y|x). Here, y is a discrete random variable representing the class ‘correct’ or ‘incorrect’, and x is a vector of discrete or continuous random variables

A Simple ME System (11c) • There are four feature functions created for each dimension of the observation vector • Because our trainer does not accept negative features, we create symmetric features based on whether the original observation was positive or negative • For each of these, another pair of symmetric features is created: one for the correct class, and one for the incorrect class • After adding 1 indicator feature for each class to build a truth-based prior there are a total of 42 and 46 features for the ‘Unq’ and ‘Alt’ case respectively

Improved Results with Binning (11b) • This system uses the base set of 10 and 11 observation dimensions. But, instead of using features that are linear functions of the observations, it creates a set of histogram-based binary features • As a result, they allow the model to take advantage of nonlinear relationships in the data • These features are created by sorting each of the observation dimensions by value and creating bins based on a uniform-occupancy partitioning. • With a maximum of 100 bins (chosen experimentally) and a minimum occupancy of 100, there were 2246 and 1984 binary MaxEnt feature functions for ‘Alt’ and ’Unq’ respectively

Quadratic Observation Vector (121b) • This system attempts to mimic the full covariance aspect of the GMM system • Instead of the base set of 10 and 11 observation dimensions, it uses the outer product consisting of 100 and 121 dimensions • After binning, with minimum and maximum occupancy set as above, there were 26,414 and 21,556 features in the two systems

Incorporating Augmented Features (11+b) • This system augments the original feature set with additional lattice based observations • Most of the lattices generated by our engine on this test have a very small depth, with only 1 or 2 alternates • this system has 14 observation dimensions for both cases producing approximately 2800 features after binning as above

Experiments

Conclusions • This paper describes how a maximum entropy model can be used to generate confidence scores for a speech recognition engine on an array of grammars • Results on an evaluation set of 25,991 examples that span 280 grammars demonstrate that the methods of observation selection, feature generation, and model training in this paper provide a significant improvement over a standard baseline

Enhancing Speech Recognition with Acoustic and Lexical Correlates of Pitch Accent

Enhancing Speech Recognition with Acoustic and Lexical Correlates of Pitch Accent

Presentation Transcript

California Department of Social Services SWEPC Southern California Fires 2007

National PNT Advisory Board Meeting October 4, 2007 s s s s s s s s s s s s s s s s Hank Skalski Department of Transp

California s Economy

California s Water Market: Overall Trends and Southern California s Role

California s SUD

U. S. Department of Energy

U. S. Department of Energy

U. S. Department of Energy

Ninja Turtl EE s

Delmar s Larsen, Department of Chemistry, University of California, Davis

Virtual Department S (VD-S)

U. S. DEPARTMENT OF AGRICULTURE

University of Southern California Department Computer Science

Nice to s ee you!

Van den Bogaert T., Doclo S., Moonen M. and Wouters J. ICASSP 2007

Survey ICASSP 2007 Discriminative Training

U. S. Department of Energy

California Department of Social Services SWEPC Southern California Fires 2007

U. S. Department of Energy

National PNT Advisory Board Meeting October 4, 2007 s s s s s s s s s s s s s s s s Hank Skalski