550 likes | 1.13k Views
Application of Shifted Delta Cepstral Features for GMM Language Identification. Masters Thesis Defense Jonathan J. Lareau Rochester Institute of Technology Department of Computer Science Tuesday, Oct. 24, 2006 12:00 pm. Full report can be found at: www.jonlareau.com/JonathanLareau.pdf.
E N D
Application of Shifted Delta CepstralFeatures for GMM LanguageIdentification Masters Thesis Defense Jonathan J. LareauRochester Institute of TechnologyDepartment of Computer Science Tuesday, Oct. 24, 2006 12:00 pm Full report can be found at: www.jonlareau.com/JonathanLareau.pdf
What’s the use of LID? • Pre-Processor for Automatic Speech Recognition Algorithms • Routing speech signals for: • Telecommunications • Multimedia • Human-Computer interfaces • Security applications.
Overview • Study the use of different types of Shifted Delta (SD) feature vectors for telephone speech language identification (LID). • 6 types of feature vectors • Uniform GMM pattern recognition algorithm.
Additionally… • Heuristic speech enhancement pre-processor • No phonemically labeled training data • Original code written in MATLAB 7 • Also Uses: • NETLAB [Nabney, 2002] • RASTA-MAT [Ellis, 2005]
Methods for LID • Phonemic Recognition followed by Language Modeling (PRLM) [Zissman, 1993] • Predominant method for ASR/LID • Works well, but laborious + time consuming • Difficult to extend
Why use Gaussian Mixture Models? • Alternative to PRLM methods • Avoids the laborious phonemic labeling required by PRLM techniques • Comparatively easy to extend
Method - Feature Vectors • Pre-Processing • Pre-Emphasis • Cepstral Speech Enhancement • Base Feature Extraction • Post-Processing • Cepstral Mean Subtraction • Shifted Delta Operation • Silence Removal
Pre-Emphasis Filtering (Pre-Processor) • Speech has Natural attenuation of approximately 20dB/decade, [Picone, 1993] • Pre-emphasis filter, Hpre(z), flattens speech spectrum Hpre(z) = 1 + apre z-1
Base Feature Vector Extraction • All based on Cepstral Coefficients • Linear Predictive (LP-CC) • All-pole filter models formant envelope • Mel-Frequency (MF-CC) • Psycho-Acoustic frequency scaling • Mimic Response of human ear • Perceptual Linear Prediction (PLP-CC) • Psycho-Acoustic scaling followed by Linear Prediction
Cepstral Mean Subtraction (Post-Processor) • A simplistic method to reduce channel effects. • Once Cepstral feature vectors are calculated using MF, LP, or PLP: • mean feature vector for the entire utterance is subtracted off
Shifted Delta Cepstra • Pseudo-prosodic feature vectors from acoustic feature vectors • Quick approximation to true prosodic modeling • ‘Stacks’ blocks of evenly spaced and differenced feature vectors.
Method - Classification Task • Each language uses a different distribution over the feature space. • Difficult because: • Shape alone doesn’t distinguish between languages • Density information along feature space surface needs to be included
…so we use Gaussian Mixture Models The M-V-N-PDF is: A Gaussian Mixture Model (GMM) is then: With conditions: • w(j) is mixture weight (prior) for component j • M is model order • p(x|j) is M-V-N PDF for jth component
Gaussian Mixture Models model distributions as well as shape
Results • OGI Multilanguage Telephone Speech Database [Muthusamy et. al., 1992] • Mutually Exclusive training and test sets. • SD-LP-CC performed best on 3-Language and 10-Language tasks • SD coefficients increased accuracy by ~10% over standard LP and MF feature vectors • Results were consistent and repeatable
3-Language Task Pre/Post Disabled Pre/Post Enabled LP-CC 59.40% SD-LP-CC 71.95% MF-CC 54.86% SD-MF-CC 67.92% PLP-CC 61.51% SD-PLP-CC 63.31% LP-CC 47.15% SD-LP-CC 65.49% MF-CC 52.50% SD-MF-CC 63.26% PLP-CC 47.64% SD-PLP-CC 61.20%
Accuracy Vs. Amount of Training Data Approximate Trend Line Outliers due to High Mixture order, low amount of training data, and stochastic nature of data selection and NETLAB training algorithm. The cutoff point for the amount of training data that must be used in order to assure accurate results increases with the mixture order.
Comparisons • Results agree with previous work on SDC • reported accuracies between 70%-75% [Deller et. al., 2002] [Kohler, 2002] [Reynolds, 2002] • This thesis specifically addresses effects of different derivations of SDC on LID
Comparisons, cont’d… • Also [Wang & Qu, 2003] • Used GMBM-UBBM • 70.128% accuracy • 128 Mixtures • Our algorithm - 71.13% with: • Reduced mixture order • Reduced training data • Without using bigram or universal background modeling
Conclusions • SDC Features improved LID performance over standard features across all categories • SD-LP-CC perform the best overall in both 3 and 10 language tasks.
Future Work • Gender Specific Modeling • Perform hill climbing • Add UBM • KL - Divergence
Bibliography • Daniel P. W. Ellis. Plp and Rasta and mfcc, and inversion in matlab. http://www.ee.columbia.edu/ dpwe/resources/matlab/rastamat/, 2005. • Ian T. Nabney. NETLAB: Algorithms for pattern Recognition. Advances in Pattern Recognition. Springer, 2002. • H. Hermansky. Perceptual linear predictive (plp) analysis of speech. J. Acoust. Soc. Am., vol. 87, no. 4:1738-1752, Apr 1990. • Y. K. Muthusamy, R.A.C. & Oshika, B.T. The OGI multi-language telephone speech corpus Proceedings of the 1992 International Conference on Spoken Language Processing (ICSLP 92), Alberta, 1992 • Dan Qu, Bingxi Wang, Automatic language identification based on GMBM-UBBM Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, 26-29 Oct 2003, 722-727 • J.R. Deller Jr., P.A. Torres-Carrasquillo, D.A. Reynolds, Language Identification Using Gaussian Mixture Model Tokenization Proc. International Conference on Acoustics, Speech, and Signal Processing in Orlando, FL, IEEE, 2002, 757-760 • Kohler, M.K., Kennedy, M.A., Language identification using shifted delta cepstra Circuits and Systems, 2002. MWSCAS-2002. The 2002 45th Midwest Symposium on, 2002, 3, 69-72 • D.A. Reynolds, M.A. Kohler, R.J. Greene, J.R. Deller Jr., E. Singer, P.A. Torres-Carrasquillo, Approaches to language identification using Gaussian mixture models and shifted delta cepstral features, Proc. International Conference on Spoken Language Processing in Denver,CO, ISCA, pages 33-36,82-92, September 2002. • Zissman, M., Automatic Language Identification using Gaussian Mixture and Hidden Markov Models, ICASSP, 1993 • Picone, J., Signal Modeling Techniques in Speech Recognition, in Proc. IEEE 81:1215-1247, Sept. 1993
Supplemental Slides • Speech Production • Cepstral Coefficients • Feature Calculation • Linear Prediction • Psycho-Acoustic Scaling (Mel-Frequency) • Perceptual Linear Prediction • 3-Language Confusion Matrices
Source-Filter Model http://www.spectrum.uni-bielefeld.de/~thies/HTHS_WiSe2005-06/source-filter.jpg
Cepstral Coefficients • Compact way of representing the formant envelope of a speech signal. Formant envelope information is encoded here in the first few Cepstral coefficients. We use the first 12 coefficients, but omit the very first.
Linear Prediction • Models Formant Envelope as all pole filter. Where S(z) is the Speech waveform, E(z) is the excitation signal and A(z) is the all-pole filter:
Linear Prediction • Linear Predictive coefficients can be found by solving: where R= [R1,R2, . . . ,RP+1] is the auto-correlation vector, a= [a1, a2, . . . , aP+1] is the Linear Predictive coefficient vector, P denotes the model order, [· · ·]−1 denotes the matrix inverse, and ∗denotes the complex conjugate operation.
Calculation of LP-CC’s • Create an N-point frequency spectrum by evaluating • Then find the Cepstral Coefficients
Calculation of MF-CC’s • Filter bank of Mel-Scaled Filters • Sum the energy in each channel Power Spectrum of Input signal: Energy in each channel:
Cont’d • Then use inverse discrete Cosine Transform of the log10 of channel energies.
Perceptual Linear Prediction • First use Perceptual Scaling, such as Mel, on the spectrum. • Then use Linear Prediction to derive the Cepstral Coefficients. • In studies by [Hermansky, 1990], was shown to reduce speaker dependence.