Application of Shifted Delta Cepstral Features for GMM Language Identification

Application of Shifted Delta CepstralFeatures for GMM LanguageIdentification Masters Thesis Defense Jonathan J. LareauRochester Institute of TechnologyDepartment of Computer Science Tuesday, Oct. 24, 2006 12:00 pm Full report can be found at: www.jonlareau.com/JonathanLareau.pdf

What’s the use of LID? • Pre-Processor for Automatic Speech Recognition Algorithms • Routing speech signals for: • Telecommunications • Multimedia • Human-Computer interfaces • Security applications.

Overview • Study the use of different types of Shifted Delta (SD) feature vectors for telephone speech language identification (LID). • 6 types of feature vectors • Uniform GMM pattern recognition algorithm.

Additionally… • Heuristic speech enhancement pre-processor • No phonemically labeled training data • Original code written in MATLAB 7 • Also Uses: • NETLAB [Nabney, 2002] • RASTA-MAT [Ellis, 2005]

Methods for LID • Phonemic Recognition followed by Language Modeling (PRLM) [Zissman, 1993] • Predominant method for ASR/LID • Works well, but laborious + time consuming • Difficult to extend

Why use Gaussian Mixture Models? • Alternative to PRLM methods • Avoids the laborious phonemic labeling required by PRLM techniques • Comparatively easy to extend

Method - Feature Vectors • Pre-Processing • Pre-Emphasis • Cepstral Speech Enhancement • Base Feature Extraction • Post-Processing • Cepstral Mean Subtraction • Shifted Delta Operation • Silence Removal

Pre-Emphasis Filtering (Pre-Processor) • Speech has Natural attenuation of approximately 20dB/decade, [Picone, 1993] • Pre-emphasis filter, Hpre(z), flattens speech spectrum Hpre(z) = 1 + apre z-1

Pre-Emphasis Filtering

Heuristic Speech Enhancement (Pre-Processor)

Base Feature Vector Extraction • All based on Cepstral Coefficients • Linear Predictive (LP-CC) • All-pole filter models formant envelope • Mel-Frequency (MF-CC) • Psycho-Acoustic frequency scaling • Mimic Response of human ear • Perceptual Linear Prediction (PLP-CC) • Psycho-Acoustic scaling followed by Linear Prediction

Cepstral Mean Subtraction (Post-Processor) • A simplistic method to reduce channel effects. • Once Cepstral feature vectors are calculated using MF, LP, or PLP: • mean feature vector for the entire utterance is subtracted off

Shifted Delta Cepstra • Pseudo-prosodic feature vectors from acoustic feature vectors • Quick approximation to true prosodic modeling • ‘Stacks’ blocks of evenly spaced and differenced feature vectors.

Shifted Delta Calculation

Silence Removal

Method - Classification Task • Each language uses a different distribution over the feature space. • Difficult because: • Shape alone doesn’t distinguish between languages • Density information along feature space surface needs to be included

No Simple Decision Boundary …

…so we use Gaussian Mixture Models The M-V-N-PDF is: A Gaussian Mixture Model (GMM) is then: With conditions: • w(j) is mixture weight (prior) for component j • M is model order • p(x|j) is M-V-N PDF for jth component

GMM Density Estimation ex//

Gaussian Mixture Models model distributions as well as shape

Software Architecture - Training

Software Architecture - Testing

Results • OGI Multilanguage Telephone Speech Database [Muthusamy et. al., 1992] • Mutually Exclusive training and test sets. • SD-LP-CC performed best on 3-Language and 10-Language tasks • SD coefficients increased accuracy by ~10% over standard LP and MF feature vectors • Results were consistent and repeatable

Speech Enhancement Results

3-Language Task Pre/Post Disabled Pre/Post Enabled LP-CC 59.40% SD-LP-CC 71.95% MF-CC 54.86% SD-MF-CC 67.92% PLP-CC 61.51% SD-PLP-CC 63.31% LP-CC 47.15% SD-LP-CC 65.49% MF-CC 52.50% SD-MF-CC 63.26% PLP-CC 47.64% SD-PLP-CC 61.20%

10-Language Task SD-LP-CC

Consistency of Results

Accuracy Vs. Amount of Training Data Approximate Trend Line Outliers due to High Mixture order, low amount of training data, and stochastic nature of data selection and NETLAB training algorithm. The cutoff point for the amount of training data that must be used in order to assure accurate results increases with the mixture order.

Comparisons • Results agree with previous work on SDC • reported accuracies between 70%-75% [Deller et. al., 2002] [Kohler, 2002] [Reynolds, 2002] • This thesis specifically addresses effects of different derivations of SDC on LID

Comparisons, cont’d… • Also [Wang & Qu, 2003] • Used GMBM-UBBM • 70.128% accuracy • 128 Mixtures • Our algorithm - 71.13% with: • Reduced mixture order • Reduced training data • Without using bigram or universal background modeling

Conclusions • SDC Features improved LID performance over standard features across all categories • SD-LP-CC perform the best overall in both 3 and 10 language tasks.

Future Work • Gender Specific Modeling • Perform hill climbing • Add UBM • KL - Divergence

Bibliography • Daniel P. W. Ellis. Plp and Rasta and mfcc, and inversion in matlab. http://www.ee.columbia.edu/ dpwe/resources/matlab/rastamat/, 2005. • Ian T. Nabney. NETLAB: Algorithms for pattern Recognition. Advances in Pattern Recognition. Springer, 2002. • H. Hermansky. Perceptual linear predictive (plp) analysis of speech. J. Acoust. Soc. Am., vol. 87, no. 4:1738-1752, Apr 1990. • Y. K. Muthusamy, R.A.C. & Oshika, B.T. The OGI multi-language telephone speech corpus Proceedings of the 1992 International Conference on Spoken Language Processing (ICSLP 92), Alberta, 1992 • Dan Qu, Bingxi Wang, Automatic language identification based on GMBM-UBBM Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on, 26-29 Oct 2003, 722-727 • J.R. Deller Jr., P.A. Torres-Carrasquillo, D.A. Reynolds, Language Identification Using Gaussian Mixture Model Tokenization Proc. International Conference on Acoustics, Speech, and Signal Processing in Orlando, FL, IEEE, 2002, 757-760 • Kohler, M.K., Kennedy, M.A., Language identification using shifted delta cepstra Circuits and Systems, 2002. MWSCAS-2002. The 2002 45th Midwest Symposium on, 2002, 3, 69-72 • D.A. Reynolds, M.A. Kohler, R.J. Greene, J.R. Deller Jr., E. Singer, P.A. Torres-Carrasquillo, Approaches to language identification using Gaussian mixture models and shifted delta cepstral features, Proc. International Conference on Spoken Language Processing in Denver,CO, ISCA, pages 33-36,82-92, September 2002. • Zissman, M., Automatic Language Identification using Gaussian Mixture and Hidden Markov Models, ICASSP, 1993 • Picone, J., Signal Modeling Techniques in Speech Recognition, in Proc. IEEE 81:1215-1247, Sept. 1993

End of Presentation.

Supplemental Slides • Speech Production • Cepstral Coefficients • Feature Calculation • Linear Prediction • Psycho-Acoustic Scaling (Mel-Frequency) • Perceptual Linear Prediction • 3-Language Confusion Matrices

Source-Filter Model http://www.spectrum.uni-bielefeld.de/~thies/HTHS_WiSe2005-06/source-filter.jpg

Cepstral Coefficients • Compact way of representing the formant envelope of a speech signal. Formant envelope information is encoded here in the first few Cepstral coefficients. We use the first 12 coefficients, but omit the very first.

Linear Prediction • Models Formant Envelope as all pole filter. Where S(z) is the Speech waveform, E(z) is the excitation signal and A(z) is the all-pole filter:

Linear Prediction • Linear Predictive coefficients can be found by solving: where R= [R1,R2, . . . ,RP+1] is the auto-correlation vector, a= [a1, a2, . . . , aP+1] is the Linear Predictive coefficient vector, P denotes the model order, [· · ·]−1 denotes the matrix inverse, and ∗denotes the complex conjugate operation.

Linear Prediction

Calculation of LP-CC’s • Create an N-point frequency spectrum by evaluating • Then find the Cepstral Coefficients

LP - CC’s

Mel Frequency Scale

Calculation of MF-CC’s • Filter bank of Mel-Scaled Filters • Sum the energy in each channel Power Spectrum of Input signal: Energy in each channel:

Mel-Frequency Scaling

Cont’d • Then use inverse discrete Cosine Transform of the log10 of channel energies.

MF-CC’s

Perceptual Linear Prediction • First use Perceptual Scaling, such as Mel, on the spectrum. • Then use Linear Prediction to derive the Cepstral Coefficients. • In studies by [Hermansky, 1990], was shown to reduce speaker dependence.

PLP Scaling

PLP-CC’s

Application of Shifted Delta Cepstral Features for GMM Language Identification

Application of Shifted Delta Cepstral Features for GMM Language Identification

Presentation Transcript

Expectation Maximization for GMM

GMM engineer

Language Features

Characteristic Features of Language

Language Features

Language Features

Using N-gram and Word Network Features for Native Language Identification

6 features of persuasive language

Language Features for JDK7

Language Features of Informational Texts

Identification of Distributed Features in SOA

MERGING SEGMENTAL AND RHYTHMIC FEATURES FOR AUTOMATIC LANGUAGE IDENTIFICATION

Application of community identification methods for image segmentation

Mushroom identification application

Language Identification

Design features of language

Supply shifted left Supply shifted right Demand shifted left Demand shifted right

Design features of language

GMM Pfaudler

Expectation Maximization for GMM

Supply shifted left Supply shifted right Demand shifted left Demand shifted right

Fly Delta” Check-In Mobile Application of Delta.