1 / 27

Automatic Speaker Recognition: Technologies, Evaluations and Possible Future

Automatic Speaker Recognition: Technologies, Evaluations and Possible Future. Gérard CHOLLET CNRS-LTCI, GET-ENST chollet@tsi.enst.fr. Outline. Why Speaker Recognition ? Taxonomy (i.e. tasks) Applications (security, forensic,…) Pros and Cons Speaker Characteristics in the Speech Signal

roxy
Download Presentation

Automatic Speaker Recognition: Technologies, Evaluations and Possible Future

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Speaker Recognition:Technologies, Evaluations and Possible Future Gérard CHOLLET CNRS-LTCI, GET-ENST chollet@tsi.enst.fr Automatic Speaker Recogniton

  2. Outline • Why Speaker Recognition ? • Taxonomy (i.e. tasks) • Applications (security, forensic,…) • Pros and Cons • Speaker Characteristics in the Speech Signal • How to perform Speaker Recognition ? • Evaluation (NIST,…) • Voice Transformations and Forgery (occasional, dedicated) • Audio-visual Speaker Verification • Conclusions, Perspectives Automatic Speaker Recogniton

  3. 1. Why Should a Computer Recognize Who Is Speaking? • Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...) • Limited access (secured areas, data bases) • Personalization (only respond to its master’s voice) • Locate a particular person in an audio-visual document (information retrieval) • Who is speaking in a meeting ? • Is a suspect the criminal ? (forensic applications) Automatic Speaker Recogniton

  4. 2. Taxonomy of the Automatic Speaker Recognition Tasks • Speaker verification (Voice Biometric??) • Are you really who you claim to be ? • Speaker identification (Speaker ID) : • Is this speech segment coming from a known speaker ? • How large is the set of speakers (population of the world) ? • Speaker detection, segmentation, indexing, retrieval, tracking : • Looking for recordings of a particular speaker • Combining speech and speaker recognition • Adaptation to a new speaker, speaker typology • Personalization in dialogue systems Automatic Speaker Recogniton

  5. 3. Applications • Access Control • Physical facilities, Computer networks, Websites • Transaction Authentication • Telephone banking, e-Commerce • Speech data Management • Voice messaging, Search engines • Law Enforcement • Forensics, Home incarceration Automatic Speaker Recogniton

  6. 4. Advantages and Disadvantages • Advantages • Most suited modality over the telephone • Low cost (microphone, A/D), Ubiquity • Possible integration on a smart (SIM) card • Natural bimodal fusion : speaking face • Disadvantages • Lack of discretion • Possibility of imitation and electronic imposture • Lack of robustness to noise, distortion,… • Temporal drift Automatic Speaker Recogniton

  7. 5. Speaker Characteristics in the Speech Signal • Differences in • Vocal tract shapes and muscular control • Fundamental frequency (typical values) • 100 Hz (Male), 200 Hz (Female), 300 Hz (Child) • Glottal waveform • Phonotactics • Lexical usage • The differences between voices of twins is a limit case • Voices can also be imitated, disguised and electronically transformed Automatic Speaker Recogniton

  8. 5.1 Speaker Characteristics: different factors Supra-segmental factors (>30ms) speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits Segmental factors (~30ms) glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness) vocal tract:characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef.) spectral envelope of / i: / Speaker A Speaker B A f Automatic Speaker Recogniton

  9. 5.2 Speaker Characteristics: Acoustic Features • Short term spectral analysis Automatic Speaker Recogniton

  10. 5.3 Speaker Characteristics: Intra- and Inter- Speaker Variability Automatic Speaker Recogniton

  11. 6.1 How: history Automatic Speaker Recogniton

  12. 6.2 How: current approaches Automatic Speaker Recogniton

  13. 6.3 How: HMM Structure is Application Dependent Automatic Speaker Recogniton

  14. 6.4 How: Gaussian Mixture Models (GMMs) • Parametric representation of the probability distribution of observations: Automatic Speaker Recogniton

  15. 6.5 How: GMM’s example 8 Gaussians per mixture Automatic Speaker Recogniton

  16. 6.6 How: Decision Theory for Speaker Verification • Two types of errors : • False rejection (a client is rejected) • False acceptation (an impostor is accepted) • Decision theory : given an observation O and a claimed identity • H0 hypothesis : it comes from an impostor • H1 hypothesis : it comes from our client • H1 is chosen if and only if P(H1|O) > P(H0|O) ,which could be rewritten (using Bayes law) as: Automatic Speaker Recogniton

  17. 6.8 How: Decision Automatic Speaker Recogniton

  18. 6.9 How: Distribution of scores Automatic Speaker Recogniton

  19. 6.10 How: Detection Error Tradeoff (DET) Curve Automatic Speaker Recogniton

  20. 7. Evaluation • Decision cost (FA, FR, priors, costs,…) • Reference systems (open software) • Torch - a Machine Learning library (www.torch.ch) • ALIZE (www.lia.univ-avignon.fr/heberges/ALIZE/) • BECARS (www.tsi.enst.fr/~blouet/Becars/) • Evaluations (algorithms, field trials, ergonomics,…) • NIST Speaker detection campaigns Automatic Speaker Recogniton

  21. 7.1 Evaluations: National Institute of Standards & Technology (NIST) • Annual evaluation since 1995 • Common paradigm for comparing technologies Automatic Speaker Recogniton

  22. 7.2 Evaluations: NIST 2004 Automatic Speaker Recogniton

  23. 8. Voice Transformations and Forgery (occasional, dedicated) • Isolated individuals with few resources or “professional impostors” with a dedicated budget can menace the security of speaker recognition systems • Voice transformation technologies (e.g. segmental synthesis using an inventory of client speech data) are nowadays available • Speaker recognition research should explicitly address this forgery issue and define appropriate countermeasures • Prevention by predicting many different forgery scenarios Automatic Speaker Recogniton

  24. Speaking Faces : Motivations A person speaking in front of a camera offers 2 modalities for identity verification (speech and face). The sequence of face images and the synchronisation of speech and lip movements could be exploited. Imposture is much more difficult than with single modalities. Many PCs, PDAs, mobile phones are equiped with a camera. Audio-Visual Identity Verification will offer non-intrusive security for e-commerce, e-banking,… Automatic Speaker Recogniton

  25. 9.1 Speaking faces: Audio-Visual Approach Automatic Speaker Recogniton

  26. A talking face model Using Hidden Markov Models (HMMs) Each state of the model generates a sequence of feature vectors Automatic Speaker Recogniton

  27. 10. Conclusions and Perspectives • Deliberate forgery is a challenge for speech only systems • Verification of identity based on features extracted from talking faces should be developed • Common databases and evaluation protocols are necessary • Free access to reference systems and databases, will facilitate future developments • Apply this paradigm to more than audio-visual modalities => see BioSecure-NoE Automatic Speaker Recogniton

More Related