170 likes | 378 Views
Language and Speaker Identification using Gaussian Mixture Model. Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002. Outline. Motivation of having Speaker Identification and Language Identification System Reasons of using Gaussian Mixture Model Experiment
E N D
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002
Outline • Motivation of having Speaker Identification and Language Identification System • Reasons of using Gaussian Mixture Model • Experiment • Results • Conclusion
Motivation of having Language Identification • To detect multi-lingual speech for further speech recognition system such as IBM ViaVoice and Microsoft SAPI • Demo
Motivation of having Speaker Identification • Speaker Identification Systems using speech recognition technique: • Security Lock System • Speaker Tracking System to locating the specified person in the video clips (Demo) • Voice Mail System
Reasons of using Gaussian Mixture Model • Gaussian Mixture Model is a type of density model which includes a number of component functions • These component functions are combined to provide a multi-model density
Reasons of using Gaussian Mixture Model (cont’) • Because speakers and languages have their own statistical density
Reasons of using Gaussian Mixture Model (cont’) • Expectation-Maximization (EM) is a well established maximum likelihood algorithm for fitting a mixture model to a set of training data. • By EM, a set of GMMs is trained and used for recognition.
Steps of the Identification Systems • Pre-processing Stage: • 1. Collect the specific speakers or languages sound clips • 2. Train the GMM using the collected sound clips • Testing Stage: • 1. Calculate the scores of each GMM • 2. Select the maxium score as the result
Probability Score Calculation • After EM training, the GMM is include means and variances for each mixture component functions. • Using the equation shown below, the score is calculated:
Experiment Setup • In Common: • Testing and Training Data • Score: News Video Clips • Audio Data: 22kHz • Feature: 24 MFCC • No. of Mixture: 256 • Duration: >6 seconds
Language Identification Experiment Setup • For Language Identification System: • Three Languages are tested: 1. English, 2. Cantonese and 3. Putonghua • 14 sound clips are tested for each case, i.e. 42 sound clips in total
Result of Language Identification • Overall it can get 36 out of 42 correct. (~85% accurate) • Thus, the language is identified and the appropriated speech recognition engine is selected for “speech to text” process.
Speaker Identification Experiment Setup • For Speaker Identification System: • Testing speakers: five males and five females, totally 50 sound clips • Can be close-set or open-set experiment • For close-set experiment, we select the maximum score achieved as our result. • For open-set experiment, the result score is normalized with the score of silence model
Result of Speaker Identification • Close-set • i.e. select the best speaker within the group • correct: 49 out of 50 (~98%) • Open-set • correct: 45 out of 50 (~90%) • false alarm (accepted wrong speaker): 1.5% • false reject (rejected correct speaker): 6%
Conclusion • GMM is suitable for statistical analysis • Language Identification ==> 85% • Speaker Identification ==> ~90%