270 likes | 278 Views
Improved Speech Recognition using Acoustic and Lexical Correlates of Pitch Accent in a N -Best Rescoring Framework. S. Ananthaskrishnan and S. Narayanan Department of EE, Southern California ICASSP 2007. Reporter: Shih-Hung Liu 2007/05/14. Outline. Introduction Data Corpus and Baseline ASR
E N D
Improved Speech Recognition using Acoustic and Lexical Correlates of Pitch Accent in a N-Best Rescoring Framework S. Ananthaskrishnan and S. Narayanan Department of EE, Southern California ICASSP 2007 Reporter: Shih-Hung Liu 2007/05/14
Outline • Introduction • Data Corpus and Baseline ASR • Prosody model • Acoustic-prosodic model • De-lexicalized prosody sequence model • Lexical prosody model • Experimental results • Conclusions
Introduction • Most statistical speech recognition systems make use of segment-level features, derived mainly from spectral envelope characteristics of the signal, but ignore supra-segmental cues that carry additional information likely to be useful for speech recognition • These cues, which constitute the prosody of the utterance and occur at the syllable, word and utterance level, are closely related to the lexical and syntactic organization of the utterance • In this paper, we explore the use of acoustic and lexical correlates of a subset of these cues in order to improve recognition performance on a read-speech corpus
Data Corpus • The Boston University Radio News Corpus (BU-RNC) consists of about 3 hours of read speech with 6 speakers (3 female, 3 male). • We use this corpus because it contains prosodic annotations in the form of ToBI-style labels for pitch accents, phrase boundaries and lexical break indices
Baseline ASR • We used the University of Colorado SONIC continuous speech recognizer to develop the baseline ASR • We adapted context-dependent triphone acoustic models from the Wall Street Journal task with data from the training partitions of the BU-RNC using the tree-based MAPLR algorithm • We used PMVDR (Perceptual Minimum Variance Distortionless Response) features derived from the acoustic signal to train these models • A standard back-off trigram language model with Kneser-Ney smoothing was trained with a mixture of text from the WSJ, HUB-4 and BU datasets
Prosody model • We augment the standard ASR equation to include prosodic information as follows:
Prosody model • Based on conditional independence assumptions Acoustic-prosodic model Lexical prosody model De-lexicalized prosody sequence model
Acoustic-prosodic model • acoustic-prosodic features that make up Ap include: • 1. F0: F0-range features (max-min, max-avg, avg-min) • 2. Energy: within-syllable energy range features (maxmin, avg-min) • 3. Timing: syllable nucleus duration • These features were normalized to minimize effects of speaker- or nucleus-specific variation. • The model is trained as a feedforward neural network (MLP) with 8 input nodes, 25 hidden nodes and 2 output nodes with softmax activation, with outputs interpreted as posterior probabilities
De-lexicalized prosody sequence model • The term p(P) establishes constraints on the sequence of pitch accent events P • Since P has a binary vocabulary, it was robustly estimated from small amounts of training data • We modeled this component as a 4-gram back-off language model with pitch accent labels obtained from the training data
Lexical prosody model • Since we built prosody models at the syllable level, we first decomposed the sequence of words into the corresponding sequence of syllables S using the syllabifier • We have previously shown that these canonical stress labels exhibit high correlation with pitch accents. • This provided us with another stream of features L. The lexical prosody sequence model then becomes p(P|W) = p(P|S,L)
Conclusions • In this paper, we presented a N-best re-ranking scheme using a prosody model that was decoupled from the main ASR system. • The re-ranking method achieved a modest but significant reduction in WER of 1.3% (relative) compared to the baseline recognition system.
Maximum Entropy Confidence Estimation for Speech Recognition C. White, JHU J. Droppo, A. Acero, J. Odell, Microsoft Research ICASSP 2007 Reporter: Shih-Hung Liu 2007/05/14
Outline • Introduction • Baseline System • Data set • Observation Selection • GMM Baseline • Maximum Entropy System • A Simple ME System • Improved Results with Binning • Quadratic Observation Vector • Incorporating Augmented Features • Experiments • Conclusions
Introduction • For many automatic speech recognition (ASR) applications, it is useful to predict the likelihood that the recognized string contains an error • The standard confidence estimation design consists of a classifier that predicts the probability of error using several observations taken from the recognition lattice emitted by the ASR engine • If a rich lattice is available, it can be renormalized to provide a good confidence estimate
Introduction • The first improvement provides significant gains in overall accuracy, as well as good generalization behavior. This is accomplished with the introduction of a maximum entropy classifier • The second improvement allows the system to provide good confidence estimates, even when a rich recognition lattice is not available • The solution presented here is to produce alternate features designed to contain information similar to what has been pruned from the lattice
Baseline System • Our goal was to build a system that generates good confidence estimates. This means that it should work transparently across a variety of recognition grammars. • It should be robust to duration, speaker, channel, and other irrelevant factors • We merged existing data to construct a new corpus. • It contains over 250,000 utterances pulled from source corpora covering different acoustic channels, additive noise, and accents
Observation Selection • with lattice features denoted with an *, and augmented-set features denoted with a **. Features used in the ‘Unq’ case are denoted with a ‘U’, ‘Alt’ with a ‘A’
GMM Baseline • The baseline system consists of two GMMs, one that models correctly recognized utterances (c) and one that models incorrectly recognized utterances (i) • Both c and i models use a full covariance matrix and have been trained using the expectation maximization (EM) algorithm
Maximum Entropy System • Our model is of the form p(y|x). Here, y is a discrete random variable representing the class ‘correct’ or ‘incorrect’, and x is a vector of discrete or continuous random variables
A Simple ME System (11c) • There are four feature functions created for each dimension of the observation vector • Because our trainer does not accept negative features, we create symmetric features based on whether the original observation was positive or negative • For each of these, another pair of symmetric features is created: one for the correct class, and one for the incorrect class • After adding 1 indicator feature for each class to build a truth-based prior there are a total of 42 and 46 features for the ‘Unq’ and ‘Alt’ case respectively
Improved Results with Binning (11b) • This system uses the base set of 10 and 11 observation dimensions. But, instead of using features that are linear functions of the observations, it creates a set of histogram-based binary features • As a result, they allow the model to take advantage of nonlinear relationships in the data • These features are created by sorting each of the observation dimensions by value and creating bins based on a uniform-occupancy partitioning. • With a maximum of 100 bins (chosen experimentally) and a minimum occupancy of 100, there were 2246 and 1984 binary MaxEnt feature functions for ‘Alt’ and ’Unq’ respectively
Quadratic Observation Vector (121b) • This system attempts to mimic the full covariance aspect of the GMM system • Instead of the base set of 10 and 11 observation dimensions, it uses the outer product consisting of 100 and 121 dimensions • After binning, with minimum and maximum occupancy set as above, there were 26,414 and 21,556 features in the two systems
Incorporating Augmented Features (11+b) • This system augments the original feature set with additional lattice based observations • Most of the lattices generated by our engine on this test have a very small depth, with only 1 or 2 alternates • this system has 14 observation dimensions for both cases producing approximately 2800 features after binning as above
Conclusions • This paper describes how a maximum entropy model can be used to generate confidence scores for a speech recognition engine on an array of grammars • Results on an evaluation set of 25,991 examples that span 280 grammars demonstrate that the methods of observation selection, feature generation, and model training in this paper provide a significant improvement over a standard baseline