320 likes | 462 Views
Non p arametric Bayesian Approaches for Acoustic Modeling in Speech Recognition . Amir Harati Institute for Signal and Information Processing Temple University Philadelphia, Pennsylvania, USA. Abstract.
E N D
Nonparametric Bayesian Approaches for Acoustic Modeling in Speech Recognition Amir Harati Institute for Signal and Information Processing Temple University Philadelphia, Pennsylvania, USA
Abstract Recently, nonparametric Bayesian (NPB) methods have become a popular alternative to Bayesian approaches. In such approaches, we do not fix the complexity a priori and place a prior over the complexity (or model structure). In this proposal, our goal is to investigate the application of NPB modeling to acoustic modeling. Three important problems fundamental to the acoustic modeling component of a large vocabulary speaker independent continuous speech recognition system are addressed: (1) automatic discovery of sub-word acoustic units; (2) statistical modeling of sub-word acoustic units; and (3) supervised training algorithms for nonparametric acoustic models. We propose a NPB algorithm based on an ergodic Hierarchical Dirichlet Process HMM (HDP-HMM) that automatically segments and clusters the speech signal. We apply this algorithm to the problems of automatic discovery of acoustic sub word units and generation of a pronunciation lexicon.A new type of HDP-HMM is presented that preserves the useful left-to-right properties of a conventional HMM, yet still supports automated learning of the structure and complexity from data. We will introduce a NPB algorithm for training these models for continuous speech recognition that allows us to infer different HDP-HMM models and segment the training data simultaneously. Moreover, a NPB approach is introduced that replaces the phonetic decision tree used in state of the art speech recognizers to tie triphone states.
Outline • Nonparametric Bayesian Models • Acoustic Modeling in Speech Recognition • Speech Segmentation • Automatic Discovery of Acoustic Units in Speech Recognition • Left-to-Right HDP-HMM Models • Nonparametric Bayesian Training • Summary of Contributions • Research Plan
Nonparametric Bayesian • Parametric vs. Nonparametric • Model Selection/Averaging: • Computational Cost • Discrete Optimization • Criteria • Nonparametric Bayesian Promises: • Inferring the model from the data • Immunity to over-fitting • Well defined mathematical framework
Dirichlet Distribution • Functional form: • q ϵℝk: a probability mass function (pmf). • α: a concentration parameter. • α can be interpreted as pseudo-observations. • The total number of pseudo-observations is α0. • The Dirichlet Distribution is a conjugate prior for a multinomial distribution.
Dirichlet Process (DP) • A Dirichlet Process is a Dirichlet distribution split infinitely many times • DP is a discrete distribution with infinite number of atoms. q22 q2 q21 q11 q1 q12
Dirichlet Process (DP) • Stick-Breaking Construction: • Chinese Restaurant Process (CRP): 1 π1 π2 π3
Hierarchical Dirichlet Process (HDP) • Grouped Data Clustering: • Consider data organized into several groups (e.g. documents). • DP can be used to define a mixture over each group. • However, each mixture would be independent of others. • Sometimes we want to share components among mixtures(e.g. to share topics among documents). • HDP: a) b)
Hierarchical Dirichlet Process (HDP) • Stick-Breaking Construction:
Hierarchical Dirichlet Process (HDP) • Chinese Restaurant Franchise (CRF) • Each group is corresponding to a restaurant. • There is a franchise wide menu with unbounded number of entries. • Number of dishes is logarithmicallyproportional to the number of tablesand double logarithmically (log(log()))with the number of data. • Reinforcement effect: New customers tend to sit at tables with many other customers and choose dishes that are chosen by many other tables.
Hierarchical Dirichlet Process-Based HMM (HDP-HMM) • Graphical Model: • Definition: • Inference algorithms are used to infer the values of the latent variables (ztand st). • A variation of the forward-backward procedure is used for training. • zt, stand xtrepresent a state, mixture component and observation respectively.
The Acoustic Modeling Problem in Speech Recognition • Goal of speech recognition is to map the acoustic data into word sequences: • In this formulation, P(W|A) is the probability of a particular word sequence given acoustic observations. • P(W) is the language model. • P(A) is the probability of the observed acoustic data and usually can be ignored. • P(A|W) is the acoustic model.
Speech Segmentation and Acoustic Unit Discovery • Problem Statement: • Segmentation is the most fundamental problem is speech recognition, i.e. given perfectly segmented speech, recognition performance is high. • Explicit, or standalone applications of speech segmentation algorithms are usually limited to problems such as word spotting and speech/non-speech classification. • One important application of segmentation is automatic discovery of acoustic units. • Acoustic unit selection is a critical issue in many speech recognition applications where there are limited linguistic resources or limited training data is available. • Though traditional context-dependent phone models perform well when there is ample data, automatic discovery of acoustic units offers the potential to provide good performance for resource deficient languages with complex linguistic structures.
Speech Segmentation and Acoustic Unit Discovery • Relevant Works: • Most approaches to automatic discovery of acoustic units do this in two steps: • segmentation • clustering • Segmentation is accomplished using a heuristic method that detects changes in energy and/or spectrum. Similar segments are then clustered using an agglomerative method such as a decision tree (for example see Bacchiani & Ostendorf, 1999). • Recently, Lee & Glass (2012) proposed a nonparametric Bayesian approach for unsupervised segmentation of speech. A DPM model was used. In order to obtain phoneme-like segments, they modeled each segment using a 3-state HMM. A Gibbs sampler was employed to estimate segment boundaries along with their parameters. • Another related problem is speaker diarization. In this problem, the goal is to partition an input audio stream into homogeneous segments according to the speaker identity. Fox et al. (2011) have used an HDP-HMM model to solve this problem by modeling each speaker as a single state.
Speech Segmentation and Acoustic Unit Discovery • Proposed Approach: • We use an ergodic HDP-HMM for the segmentation. • Clustering will be also based on a nonparametric Bayesian approach (e.g. HDP). • Moreover, we will generate a lexicon based on discovered units. • Experimental Setup: • Unit Classification Error: This will demonstrate how units modeled using our approach perform without considering errors that can be introduced in the lexicon generation step. • Word Error Rate (WER): This will assess the impact on performance for a system trained completely using our proposed units.
Speech Segmentation and Acoustic Unit Discovery • Segmentation of a Speech Utterance: • Result of Segmentation: • Recall: The number of co-occurrencesof segments boundaries and phoneme boundaries. • Precision: The percent of declared boundaries that coincide with phoneme boundaries.
Speech Segmentation and Acoustic Unit Discovery • Consistency of the Segmentation: • The quality of the segments can be measured using a similarity score. • This score, S, is an indicator of consistency: • s1is the in-class similarity score and is defined as the average over the correlation between different instances of segments with identical labels; • s2is the out-of-class dissimilarity score. • The quality of segmentation is higher when both numbers are closer to one. • For a meaningful comparison, the average length of segments produced by the two algorithms should be comparable.
Left-to-Right HDP-HMM with HDP emission • Problem Statement: • Modeling sub-word units (e.g. phonemes) is a crucial component of the acoustic modeling problem. • Left-to-right Hidden Markov Models (HMMs) with a mixture of Gaussians have been used successfully to model sub-word units. • All units usually use the same topology and the same number of mixtures; i.e. the complexity is fixed for all models. • The Expectation-Maximization (EM) algorithm or its variants are used for training. • Given more data, model’s structure will remain the same and only the estimation of the parameters (e.g. means and covariances) improves. • Different models have different amount of data but their complexity is the same. This means some models are over-trained and some under-trained. • Because of the lack of hierarchical structure, extending the model is heuristic. For example, there is no consensus about sharing data in gender specific modeling. Some prefer to train completely separate models for each group while others use different heuristic methods to share data.
Left-to-Right HDP-HMM With HDP Emission • Relevant Work: • Bourlard(1993) and others proposed to replace Gaussian mixture models (GMMs) with a neural network based on a multilayer perceptron (MLP). • It was shown that MLPs generate reasonable estimates of a posterior distribution of an output class conditioned on the input patterns. • This hybrid HMM-MLP system works slightly better than traditional HMM-GMMs, but the gain was not significant . • Another example of this approach is reported in (Lefèvre, 2003) and (Shang, 2009) where nonparametric density estimators (e.g. kernel methods) have been used to replace GMMs. • Henter et al. (2012) introduced a new model named a Gaussian process dynamical model (GPDM) to completely replace HMMs in acoustic modeling. This model is used only in speech synthesis and as of now lacks a corresponding recognition algorithm. • The above efforts can be classified as nonparametric non-Bayesian approaches. • Each of these approaches were proposed to model the emission distributions using a nonparametric method but they did not address the model topology problem.
Left-to-Right HDP-HMM With HDP Emission • Proposed Approach: • The proposed model will address both topology and emission distribution modeling problems. • Based on well-defined HDP-HMM Model. • Three new features are introduced: • HDP-HMM is an ergodic model. We extend the definition to a left-to-right topology. • HDP-HMM uses DPM to model emissions for each state. Our proposed model will use HDP to model the emissions. In this manner we allow components of emission distributions to be shared within a HMM. This is particularly important for left-to-right models where the number of discovered states is usually more than an ergodic HMM. As a result we have fewer data points associated with each state. • Dummy “initial” and “final” states will be included in the final definition. Estimating transition probabilities from other states connected to “final” dummy states will be formulated using ML and Bayesian frameworks.
Left-to-Right HDP-HMM With HDP Emission • Proposed Approach (cont.): • A new inference algorithm based on a blocksampler will be derived. • It is expected that the new model be more accurate and does not have some of the intrinsic problems of parametric HMMs (e.g. over-trained and under-trained models). • The hierarchical definition of the model within the Bayesian framework make it relatively easy to solve problems such as sharing data among related models (e.g. models of the same phoneme for different cluster of speakers). • Experiment Setup: • Phoneme Classification (TIMIT Database) • Isolated Word Recognition (AlphadigitDatabase)
Left-to-Right HDP-HMM With HDP Emission • Definition: • Graphical Model
Left-to-Right HDP-HMM With HDP Emission • An automatically derived model structure (without the first and last dummy states) for: • /aa/ with 175 examples • /sh/ with 100 examples • /aa/ with 2256 examples • /sh/ with 1317 examples using left-to-right HDP-HMM model. • The data used in this illustration was extracted from the training portion of the TIMIT Corpus.
Nonparametric Bayesian Training • In order to use left-to-right HDP-HMMs in a typical speech recognition system, we need a new training algorithm. • The algorithm should satisfy: • Typically we don’t have phoneme level transcriptions. The algorithm should be able to start from a utterance level transcription and train different models together. • There is at least one stage that we need to tie states or models during the training (training context dependent models). We prefer to handle this using a nonparametric Bayesian approach. • Because of the introducing of the dummy states, training different models together is possible.
Nonparametric Bayesian Training • Algorithm: • Zi: shows the model index for each data point Xi. • Wj: sequence of Xi with same Zi. • Mj: model ID. • Initialize Zi either randomly or bootstrap using a conventional system. • The result is several sub-sequences. Each sub-sequence will have a unique Zi. Therefore a sequence of Xi will be converted into a sequence of sub-sequences, Wj. • For a given sequence of data, use the transcription to generate a list of models. • Regroup sub-sequences, Wj, based on their corresponding Zj and distribute each group to the corresponding HDP-HMM model (MZi). • Train each HDP-HMM using the inference algorithm. Training each left-to-right HDP-HMM involves several sequences of data . Fortunately, since each left-to-right HDP-HMM has a start dummy state (the first state that does not emit) using multiple sequences in the inference algorithm does not change the algorithm.
Nonparametric Bayesian Training • After all models are trained, re-estimate theZifor all Xi. This can be done using Viterbi algorithm or in a Bayesian framework. • After several iterations and after convergence we can fix the topology of each model.
Nonparametric Bayesian Training • Tying States: • 1.Given the monophone models, train all existing triphones in the data set and also segment the data into different states. • 2. Group all corresponding states of all triphones with the same central phone. • 3. Each of these groups will contain all the data associated with states inside the group. • 4. In each group use a DPM to cluster the data. It is also possible to use an HDP across different groups. • 5. Merge small clusters into closest cluster. • 6. Use back-off modeling (e.g., use monophones instead of triphones) for unseen triphones. • Experimental Setup: • Using several datasets for continuous speech recognition (e.g. TIMIT, WSJ and SWITCHBOARD). • Comparing the results with state of the art systems.
Summary of Contributions • Major contributions of this research are: • Introducing left-to-right HDP-HMMs with HDP emissions and a corresponding inference algorithm. • Introducing an algorithm to train left-to-right HDP-HMMs in a continuous speech recognition system. • Study the performance of a left-to-right HDP-HMM. • Use a nonparametric clustering approach for state tying. • Study the application of the nonparametric Bayesian models in automatic acoustic unit discovery. • It is expected the result of new models improves over the state of the art speech recognition systems. • Some of the preliminary experiments show promising results.
Publications • Published: • Harati, A., Picone, J., & Sobel, M. (2012). Applications of Dirichlet Process Mixtures to Speaker Adaptation. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4321–4324). Kyoto, Japan. • Submitted: • Harati, A., Picone, J., & Sobel, M. (2013). Speech Segmentation Using Hierarchical Dirichlet Processes. Submitted to INTERSPEECH. Lyon, France. • Under Development: • A NIPS conference paper on left-to-right HDP-HMM with HDP emissions. The experiments for this paper will be on phoneme classification. Theoretical contribution will be the left-to-right HDP-HMM. • A journal paper on left-to-right HDP-HMMs. This will be an extension of the above paper and will contains extensions related to dummy states. • A journal paper on the final results for automatic acoustic unit learning. This paper should contain a WER for a speech recognizer trained using the discovered units. • A journal paper (IEEE transactions) on using left-to-right HDP-HMM for continuous speech recognition using the training algorithm described in the last part of the proposal.
References Bacchiani, M., & Ostendorf, M. (1999). Joint lexicon, acoustic unit inventory and model design. Speech Communication, 29(2-4), 99–114. Bourlard, H., & Morgan, N. (1993). Connectionist Speech Recognition A Hybrid Approach. Springer. Dusan, S., & Rabiner, L. (2006). On the relation between maximum spectral transition positions and phone boundaries. Proceedings of INTERSPEECH (pp. 1317–1320). Pittsburgh, Pennsylvania, USA. Fox, E., Sudderth, E., Jordan, M., & Willsky, A. (2011). A Sticky HDP-HMM with Application to Speaker Diarization. The Annalas of Applied Statistics, 5(2A), 1020–1056. Harati, A., Picone, J., & Sobel, M. (2012). Applications of Dirichlet Process Mixtures to Speaker Adaptation. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4321–4324). Kyoto, Japan. Harati, A., Picone, J., & Sobel, M. (2013). Speech Segmentation Using Hierarchical Dirichlet Processes. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (p. TBD). Vancouver, Canada. Henter, G. E., Frean, M. R., & Kleijn, W. B. (2012). Gaussian process dynamical models for nonparametric speech representation and synthesis. IEEE International Conference on ASSP(pp. 4505– 4508). Kyoto, Japan. Lee, C., & Glass, J. (2012). A Nonparametric Bayesian Approach to Acoustic Model Discovery. Proceedings of the Association for Computational Linguistics (pp. 40–49). Jeju, Republic of Korea. Lefèvre, F. (n.d.). Non-parametric probability estimation for HMM-based automatic speech recognition. Computer Speech & Language, 17(2-3), 113–136. Rabiner, L. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 879–893. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. StatisticaSinica, 639–650. Shang, L. (n.d.). Nonparametric Discriminant HMM and Application to Facial Expression Recognition. IEEE Conference on Computer Vision and Pattern Recognition (pp. 2090– 2096). Miami, FL, USA. Shin, W., Lee, B.-S., Lee, Y.-K., & Lee, J.-S. (2000). Speech/non-speech classification using multiple features for robust endpoint detection. proceedings of IEEE international Conference on ASSP (pp. 1899–1402). Istanbul, Turkey. Suchard, M. A., Wang, Q., Chan, C., Frelinger, J., West, M., & Cron, A. (2010). Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures. Journal of Computational and Graphical Statistics, 19(2), 419–438. Teh, Y., Jordan, M., Beal, M., & Blei, D. (2006). Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(47), 1566–1581.