1 / 16

Language and Speaker Identification using Gaussian Mixture Model

Language and Speaker Identification using Gaussian Mixture Model. Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002. Outline. Motivation of having Speaker Identification and Language Identification System Reasons of using Gaussian Mixture Model Experiment

shamus
Download Presentation

Language and Speaker Identification using Gaussian Mixture Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002

  2. Outline • Motivation of having Speaker Identification and Language Identification System • Reasons of using Gaussian Mixture Model • Experiment • Results • Conclusion

  3. Motivation of having Language Identification • To detect multi-lingual speech for further speech recognition system such as IBM ViaVoice and Microsoft SAPI • Demo

  4. Motivation of having Speaker Identification • Speaker Identification Systems using speech recognition technique: • Security Lock System • Speaker Tracking System to locating the specified person in the video clips (Demo) • Voice Mail System

  5. Reasons of using Gaussian Mixture Model • Gaussian Mixture Model is a type of density model which includes a number of component functions • These component functions are combined to provide a multi-model density

  6. Reasons of using Gaussian Mixture Model (cont’) • Because speakers and languages have their own statistical density

  7. Reasons of using Gaussian Mixture Model (cont’) • Expectation-Maximization (EM) is a well established maximum likelihood algorithm for fitting a mixture model to a set of training data. • By EM, a set of GMMs is trained and used for recognition.

  8. Steps of the Identification Systems • Pre-processing Stage: • 1. Collect the specific speakers or languages sound clips • 2. Train the GMM using the collected sound clips • Testing Stage: • 1. Calculate the scores of each GMM • 2. Select the maxium score as the result

  9. System Diagram

  10. Probability Score Calculation • After EM training, the GMM is include means and variances for each mixture component functions. • Using the equation shown below, the score is calculated:

  11. Experiment Setup • In Common: • Testing and Training Data • Score: News Video Clips • Audio Data: 22kHz • Feature: 24 MFCC • No. of Mixture: 256 • Duration: >6 seconds

  12. Language Identification Experiment Setup • For Language Identification System: • Three Languages are tested: 1. English, 2. Cantonese and 3. Putonghua • 14 sound clips are tested for each case, i.e. 42 sound clips in total

  13. Result of Language Identification • Overall it can get 36 out of 42 correct. (~85% accurate) • Thus, the language is identified and the appropriated speech recognition engine is selected for “speech to text” process.

  14. Speaker Identification Experiment Setup • For Speaker Identification System: • Testing speakers: five males and five females, totally 50 sound clips • Can be close-set or open-set experiment • For close-set experiment, we select the maximum score achieved as our result. • For open-set experiment, the result score is normalized with the score of silence model

  15. Result of Speaker Identification • Close-set • i.e. select the best speaker within the group • correct: 49 out of 50 (~98%) • Open-set • correct: 45 out of 50 (~90%) • false alarm (accepted wrong speaker): 1.5% • false reject (rejected correct speaker): 6%

  16. Conclusion • GMM is suitable for statistical analysis • Language Identification ==> 85% • Speaker Identification ==> ~90%

More Related