390 likes | 533 Views
Let us pray to Almighty to illuminate our intellect towards the righteous path. 1. An Empirical Approach for Optimization of Acoustic Models in Hindi Speech Recognition Systems
E N D
Let us pray to Almighty to illuminate our intellect towards the righteous path 1
An Empirical Approach for Optimization of Acoustic Models in Hindi Speech Recognition Systems R. K. AggarwalDept. of Computer Engineering, National Institute of Technology (NIT)Kurukshetra, Haryana, India.
Model Generation Training Speech Feature Extraction Preprocessing Input Pattern Classification Testing Recognized Words Automatic Speech Recognition (ASR) • The goal of ASR is to covert a speech signal into its equivalent text message independent of the device, speaker or the environment. • It is a pattern recognition type of problem in which features are extracted and a model is used for training and testing.
Statistical Approach to ASR Back End Acoustic Modeling Front End Recognized speech microphone parameters Pre-Processing Feature- Extraction Recognizer Speech sound LPCC/ MFCC Language Modeling 4
Statistical framework of ASR • State-of-the-art speech recognition systems use mixture Gaussian output probability distributions in HMM together with context dependent phone models. To handle the large number of state parameters of HMM, many similar states of the model are tied and the data corresponding to all these states are used to train one global state. HMMs with this type of sharing were proposed in literature under the names semi-continuous and tied-mixture HMMs. • The main components of the ASR based on statistical approach are feature extraction, acoustic models (HMMs), language model and hypothesis search unit. The acoustic model typically consists of two parts. The first is to describe how a word sequence can be represented by sub-word units and the second is the mapping from each sub word units to acoustic observations. In language model rules are introduced to follow the linguistic restrictions present in the language and to allow redemption of possible invalid phoneme sequences. • The acoustic and language models resulting from the training procedure are used as knowledge sources during decoding.
Work Significance Difficulty for the design of Indian Languages ASR • For the design and development of European languages ASR systems, where large and standard databases (e.g. TIMIT, Switchboard corpus) are available to model acoustic variability, higher degrees of mixture tying have been applied e.g. • 4000 to 8000 total tied states • A range of 8 to 128 Gaussian mixtures the same convention cannot be followed for Indian languages as the databases, prepared by various research groups, are relatively small and phonetically not very rich. Solution • In this paper we present a solution to find the right degree of mixture tying by observing empirically the performance of Hindi speech recognition system using a self prepared small database.
Front-end Design Front End of Speech Signals Mainly covers • Preprocessing • Receiving the speech sound from the speaker. • Filtering the background noise to achieve highest possible signal to noise ratio (SNR ratio). • Digitizing the analog speech signal. • Feature Extraction (Parametric Transformation) • Extracting the set of properties of an utterance that have acoustic correlation to the speech signal. • Perceptual Linear Prediction (PLP) feature extraction technique is used in front end which is based on the working of human auditory system.
PLP Block diagram of perceptual linear predictive (PLP) speech analysis
PLP Feature Extraction Critical band resolution • Critical band analysis is the basis for almost all the models based on auditory system. It represents the approximation of ear’s ability to discriminate different frequencies. Experiments have shown that 25 critical bands exist over the frequency range of human hearing, which spans from 20 Hz to 20kHz. • The critical bands have constant width of 100 Hz for center frequencies up to 500 Hz, and the bandwidths increase as the center frequency increases further. • It is a frequency-domain transformation, which can be implemented as a filterbank with bandpass filters. Bark scaling is used for filter banks. The linear frequency scale is inadequate for representing the auditory system. • Human auditory system has linear relationship to the frequency scale for low frequencies but a logarithmic relationship at higher frequencies. • One critical band corresponds to a 1.5 mm step along the basilar membrane that contains 1200 primary auditory nerve fibers.
PLP Feature Extraction • To obtain the auditory spectrum, 17 critical band filter outputs are used. Their center frequency are equally spaced in the Bark domain, defined by where f is the frequency in Hz and z covers the range 0-5 KHz, into the range 0-17 Bark (i.e.0 ≤ z ≤ 17 Bark). • Each band is simulated by a spectral weighting, where are the center frequencies and Finally, the feature vector consists of 39 values including the 12 cepstral coefficients with one energy, 13 delta cepstral coefficients and 13 delta delta coefficients.
RASTA (Relative Spectral) Noise & Channel Compensation Technique • The linguistic components of the speech are governed by the rate of change of the vocal tract shape. • The rate of change of nonlinguistic components (i.e. the noise) in speech often lies outside the typical rate of change of the vocal tract shape. • The relative spectral (RASTA) technique takes the advantage of this fact and suppresses the spectral components that change more slowly or quickly than the typical rate of change of speech . • RASTA has often been combined with the PLP method and implemented as an IIR filter and the same filter is used for all frequency bands.
Tools used for ASR • HTK 3.4.1 • Developed at Cambridge University. • Designed in C++. • Supports Linux Platform. • For Window environment, it requires an interfacing software CYGWIN. • SPHINX 4 • Developed at Carnegie Mellon University(CMU). • Designed in JAVA. • MATLAB • Julius HTK 3.4.1 is most widely used ASR tool.
Univariate Gaussian/Normal Distribution F(x) X m It is a continuous probability distribution when only one observation is under consideration. E.g. height of the students. Where f(x) represent normal density, m and s are two parameter viz. mean and standard deviation respectively of Gaussian distribution . Probability Distribution F(x) is given by: 13
Multivariate Gaussian • Multivariate • When more than one observations are under consideration • E.g. height, weight and IQ level of the student • 39 dimensional MFCC feature vector in case of ASR. where μ is the n dimensional mean vector, Σ is the n×n covariance matrix, and |Σ| is the determinant of covariance matrix Σ.
Multivariate Gaussian cont.. If the feature vectors are un-correlated, the covariance among them will be zero. In this situation only the diagonal elements will be considered as they represent the variance.
Mixture of 3 Gaussians Illustration of a mixture of 3 Gaussians in a two-dimensional • Contours of constant density for each of the mixture components, in which the 3 components are denoted red, blue and green • Contours of the marginal probability density p(x) of the mixture distribution. • A surface plot of the distribution p(x) Fig 1 Fig 2 Fig 3
Review of ASR Illustration of speech recognition process. The raw waveform of speech is first parameterized to discrete 39 dimension feature vectors at front end. These feature vectors are called observation vectors at back end in the perspective of statistical framework. Then the word string that correspond to observation vectors are decoded by the recognizer.
Hidden Markov Model • Speech characteristics • In speech there are two types of variability: • Spectral Variability • Temporal Variability To model these variabilities, double stochastic process are required one for each
a11 a01 r1 Start b1(o2) b1(o1) Observation vectors O1 O2 a22 a24 a33 a23 a12 a2 m3 End b2(o3) b2(o5) b3(o6) O3 O4 O5 O6 HMM Structure • Extended Markov Chain or Stochastic Finite State Machine • Temporal variability is covered by normal working of Markov chain. • To cover the spectral variability, there is an addition in chain. Each state of chain is characterized by a special type of pdf, i.e., mixture of multivariate Gaussians. Left to right three emitting states HMM
1.0 0.4 0.2 0.2 0.6 0.4 S5 S2 S3 S4 S1 S6 a11 a12 a13 0 0.0 a22 a23 a24 0.0 0.0 a33 a34 0.0 0.0 0.0 a44 A = Illustration 0.3 0.3 1 = 0.5 2= 0.0 3 = 0.0 4 = 0.5 5= 0.0 6 = 0.0 0.4 0.3 0.5 0.8 0.3 0.3 21
Unit Selection in Acoustic Models • Whole Word Model • It is successful for domain specific problems where small vocabulary is required. • Syllable Model • HMMs are generated on the basis of syllables normally used in different languages. • CI Phone Model • These models are simple but unable to capture the variations of a phone with respect to context. • Triphone Model (Context Dependent Phone Model) • Preceding and succeeding phones are grouped with the middle to improve the performance. Generally for 48 phones, 48*48*48 triphone combinations can be generated but very difficult to manage. To cope with this problem, tied state clustering is performed in triphone models.
States Clustering in Triphone HMMS Need Of State Tying • A typical system might have approximately 2400 states with 4 mixture components per state giving about 800k parameters in total or approximately 4800 states with 8 mixture components per state giving about 3000k parameters in total. • The training data is not sufficient to generate an appropriate Gaussian mixture model for each state. • To address this problem in context dependent model, many similar states of the model are tied and the data corresponding to all these states are used to train one global state. This leads to a large amount of data for each state, hence parameters are well estimated. • HMMs with such type of sharing were proposed in literature under the names semi-continuous and tied-mixture HMMs.
What is state clustering? State Tying/Clustering • Acoustically similar states are tied to form state clustering. • Cluster are known as Senones or Genomes the names given by various research groups. • State clusters are formed by forming a cluster tree using bottom-up approach. Tree based clustering • The leaf nodes in the tree corresponds to individual HMM states. • Acoustically similar states are clustered to form next higher level. • This iteration is performed till the desired numbers of clusters are achieved.
Experimental Setup HMM State Topology • Whole word model and crossword triphone model of HMM with linear left-right topology were used to compute the score against a sequence of features for their phonetic transcription. • In triphone model 3-states per phone, along with dummy (non-emitting) initial and final nodes, were used without the permission of state skipping. For whole word model seven states per word were used. Training & Testing The experiments were performed on a set of speech data consisting of six hundred words of Hindi language recorded by 10 male and 10 female speakers. Each time model was trained using various utterances of each word. Testing of randomly chosen hundred words spoken by different speakers is made, i.e., total test words are hundred.
Experiment with different mixtures Experiments were performed six times with different number of Gaussians along with triphone model as fundamental speech unit and MLE technique for parameter estimation, in HMM. Maximum accuracy was observed with sixteen numbers of Gaussian mixtures. This is too less in comparison to European languages ASR where normally 64 Gaussian mixtures have been used to achieve the optimum results.
Experiments with Vocabulary Sizes • Two models whole word model and sub word triphone model were investigated with various vocabulary sizes. • Smaller the size of vocabulary, lesser the chances of confusion and hence better should be the accuracy. • For small vocabulary up to 200, whole word model gives maximum accuracy, and beyond that triphone model must be used for better accuracy. • Sixteen Gaussian mixtures were used in training of the model to get best results.
Experiments with Various Tied States The number of Gaussian mixtures used for each case of tied states is sixteen. With the help of a decision tree, the mixtures of each state were tied for each base phone and the training data triphones are mapped into a smaller set of tied state triphones. Each state position of each phone has a binary tree associated with it. Maximum accuracy was observed around one thousand tied states
Conclusion • To avoid over fitting of data and to minimize computation overhead appropriate degree of mixture tying is very important. • Experimental results have shown that only 16 Gaussian mixtures and around one thousand tied states yield optimal performance in the context of databases available for Indian languages. While in case of European languages • the total number of tied states in a large vocabulary speaker independent system typically ranges between 5,000 and 10,000 states. • A range of mixtures 32 to 128 is used. • For small size vocabulary the whole word model is enough, but as the vocabulary size increases triphone model is required to achieve optimum results. The word recognition accuracy of whole word models decreases more rapidly than that of sub word models.
References • A.E. Rosenberg, L.R. Rabiner, J. Wilpon, D. Kahn. 1983. Demisyballe-Based Isolated Word Recognition System. IEEE Transactions onAcoustic, Speech, and Signal Processing ASSP, 31(3): 713-726. • A. Sharma, M.C. Shrotriya, O. Farooq, Z.A. Abbasi. 2008. Hybrid Wavelet Based LPC Features for Hindi Speech Recognition. International Journal of Information and Communication Technology,Inderscience publisher, vol. 1, pp. 373-381. • A. Sixtus and H. Ney. 2002. From Within-Word Model Search to Crossword Model Search in Large Vocabulary Continuous Speech Recognition. Computer Speech and Language. 16(2): 245-271. • C. Becchetti and K.P. Ricotti. 2004. SpeechRecognition Theory and C++ Implementation.John Wiley. • D. Klatt. 1986. Problem of Variability in SpeechRecognition and in Models of Speech Perception.In J.S Perkell and D.M Klatt (editor), Variability and Invariance in speech Processes, 300-320. Lawrence Erlbaum Assoc, N.J. Hillsdale.
References contd… • Douglas O’Shaughnessy. 2003. Interacting With Computers by Voice-Automatic Speech Recognitions and Synthesis. Proceedings of theIEEE, 91(9): 1272-1305. • F. Jelinek. 1997. Statistical Methods for Speech Recognition, MIT press. • H. Hermansky. 1990. Perceptually Predictive (PLP) Analysis of Speech. Journal of Acoustic Society ofAmerica, 87:1738-1752. • H. Hermansky and N. Morgan. 1994. RASTA Processing of Speech. IEEE Transaction of Speechand Audio Processing, 2(4): 578-589. • H. Hermansky, S. Sharma. 1999. Temporal Patterns (TRAPs) in ASR of Noisy Speech. Proc. of IEEE Conference on Acoustic Speech and Signal Processing . • J. Koehler, N. Morgan, H. Hermansky, H. G. Hirsch and G. Tong. 1994. Integrating RASTA-PLP into Speech Recognition. IEEE InternationalConference on Acoustics, Speech and Signal Processing, vol.1: 421-424.
References contd… • J. Baker, P. Bamberg et al. 1992. Large Vocabulary Recognition of wall Street Journal Sentences at Dragon System. Proc. DARPA Speech and Natural Language Workshop, 387-392. • J. Picone. 1993. Signal Modeling Techniques in Speech Recognition. Proceedings of the IEEE,81(9): 1215-1247. • F. Lee. 1989. ASR the Development of SPHINXSystem. Kluwer Academic. • E. Baum & J. A. Eagon. 1967. An Inequality with Applications to Statistical Estimation for Probabilistic Functions of Markov Processes and to a Model for Ecology. Bulletin of AmericanMathematical Society. 73:360-363. • L.R. Bahl, P.F. Brown, P.V. de Souza and R.L. Mercer. 1986. Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition. Proceeding of IEEEICASSP, 49-52.
References contd… • L.R. Rabiner. 1989. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. of the IEEE, 77(2): 257- 286. • L.R. Rabiner, and R.W. Schafer. 2007. Introduction to Digital Speech Processing, Foundations and Trends in Signal Processing, vol. 1, Issue 1-2, pp. 33-73. • Li Deng, D. O’Shaughnessy. 2003. SpeechProcessing: A Dynamic and Optimization-Oriented Approach. Marcel Dekker Inc., New York-Basel. • M. Gales and S. Young. 2007. The Application of Hidden Markov Model in Speech Recognition. Foundations and Trends in Signal Processing,1(3):195-304. • M. Hwang & X. Huang. 1992. Sub Phonetic Modeling with Markov States—Senone. In Proc.of IEEE ICASSP, 33–36. • M.J. Hunt, M. Lennig, P. Mermelstein. 1980. Experiments in Syllable-Based Recognition of Continuous Speech. IEEE Transactions onAcoustic, Speech, and Signal Processing, 880-883.
References contd… • M. Kumar, A. Verma, and N. Rajput. 2004. A Large Vocabulary Speech Recognition System for Hindi. Journal of IBM Research, vol.48, pp.703-715. • M.Y. Hwang, X. Huang and F. Alleva. 1992. Predicting Unseen Triphones with Senomes, Proc. IEEE ICASSP-93, II:311-314. • NagendraGoel, Samuel Thomas, MohitAgarwal et al. 2010. Approaches to Automatic Lexicon Learning With Limited Training Example. Proc.of IEEE Conference on Acoustic Speech and Signal Processing. • Pablo Fetter, Alfred Kaltenmeier, Thomas Kuhn Peter and Regel-Brietzmann. 1996. Improved Modeling of OOV Words in Spontaneous Speech, Int. Conf.on Acoustic, Speech, and Signal Processing.
References contd… • RivarolVergin, Douglas O’Shaughnessy, AzarshidFarhat. 1999. Generalized Mel Frequency Cepstral Coefficients for Large Vocabulary Speaker-Independent Continuous Speech Recognition. IEEE Transactions on Speech and Audio Processing, vol. 7 no. 5, pp. 525-532. • R.K. Aggarwal and M. Dave. 2008. Implementing a Speech Recognition System Interface for Indian Languages. Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged languages, IIIT Hyderabad. • R. K. Aggarwal and M. Dave. 2010. Effects of Mixtures in Statistical Modeling of Hindi Speech Recognition Systems. Proceedings of the 2ndInternational Conference on Intelligent Human Computer Interaction, Springer. • R. Shwartz, Y. Chow, O. Kimball, S. Roucos, M. Krasner, J. Makhoul. 1985. Context Dependent Modelling for Acoustic-Phonetic Recognition of Continous Speech. IEEE InternationalConference on Acoustics, Speech and Signal Processing.
References contd… • S. J. Young. 1992. The General Use of Tying in Phoneme-Based HMM Speech Recognizers. Int. Conf.on Acoustic, Speech, and Signal Processing.569-572. • S. J. Young, J. J. Odell and P. C. Woodland. 1994. Tree-Based State Tying for High Accuracy Acoustic Modeling. Proceedings of HumanLanguage Technology Workshop, 307-312. • S. J. Young and P.C. Woodland. 1993. The Use of State Tying in ContinuousSpeech Recognition, Proc. ESCA Eurospeech, 3:2203-2206, Berlin, Germany. • S. Young, M. Gales, D. Povey et al. 2006. The HTK Book, Cambridge University Engineering Department. • V. Digalakis & H. Murveit. 1994. Genones: Optimization the Degree of Tying in a Large Vocabulary HMM-Based Speech Recognizer. Proc. of IEEE ICASSP, 537–540.