1 / 55

Cours parole du 9 Mars 2005 enseignants: Dr. Dijana Petrovska-Delacrétaz et Gérard Chollet

Cours parole du 9 Mars 2005 enseignants: Dr. Dijana Petrovska-Delacrétaz et Gérard Chollet. Reconnaissance du locuteur Introduction, Historique, Domaines d’applications Les indices de l’identité dans la parole Vérification du locuteur Théorie de la decision

inga
Download Presentation

Cours parole du 9 Mars 2005 enseignants: Dr. Dijana Petrovska-Delacrétaz et Gérard Chollet

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cours parole du 9 Mars 2005enseignants: Dr. Dijana Petrovska-Delacrétazet Gérard Chollet Reconnaissance du locuteur • Introduction, Historique, Domaines d’applications • Les indices de l’identité dans la parole • Vérification du locuteur • Théorie de la decision • Dépendante / Indépendante du texte • L’imposture vocale • Vérification audio-visuelle de l’identité • Evaluations • Conclusions

  2. Why should a computer recognize who is speaking ? • Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...) • Limited access (secured areas, data bases) • Personalization (only respond to its master’s voice) • Locate a particular person in an audio-visual document (information retrieval) • Who is speaking in a meeting ? • Is a suspect the criminal ? (forensic applications)

  3. Tasks in Automatic Speaker Recognition • Speaker verification (Voice Biometrics) • Are you really who you claim to be ? • Identification (Speaker ID) : • Is this speech segment coming from a known speaker ? • How large is the set of speakers (population of the world) ? • Speaker detection, segmentation, indexing, retrieval, tracking : • Looking for recordings of a particular speaker • Combining Speech and Speaker Recognition • Adaptation to a new speaker, speaker typology • Personalization in dialogue systems

  4. Applications • Access Control • Physical facilities, Computer networks, Websites • Transaction Authentication • Telephone banking, e-Commerce • Speech data Management • Voice messaging, Search engines • Law Enforcement • Forensics, Home incarceration

  5. Voice Biometric • Avantages • Often the only modality over the telephone, • Low cost (microphone, A/D), Ubiquity • Possible integration on a smart (SIM) card • Natural bimodal fusion : speaking face • Disadvantages • Lack of discretion • Possibility of imitation and electronic imposture • Lack of robustness to noise, distortion,… • Temporal drift

  6. Speaker Identity in Speech • Differences in • Vocal tract shapes and muscular control • Fundamental frequency (typical values) • 100 Hz (Male), 200 Hz (Female), 300 Hz (Child) • Glottal waveform • Phonotactics • Lexical usage • The differences between Voices of Twins is a limit case • Voices can also be imitated or disguised

  7. Speaker Identity • suprasegmental factors • speaking speed (timing and rhythm of speech units) • intonation patterns • dialect, accent, pronunciation habits spectral envelope of / i: / • segmental factors (~30ms) • glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness) • vocal tract:characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef) Speaker A Speaker B A f

  8. What are the sources of difficulty ? • Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…) • Recording conditions (filtering, noise,…) • Channel mismatch between enrolment and testing • Temporal drift • Intentional imposture • Voice disguise

  9. Acoustic features • Short term spectral analysis

  10. Intra- and Inter-speaker variability

  11. Speaker Verification • Typology of approaches (EAGLES Handbook) • Text dependent • Public password • Private password • Customized password • Text prompted • Text independent • Incremental enrolment • Evaluation

  12. History of Speaker Recognition

  13. Current approaches

  14. “Bonjour” locuteur 1 “Bonjour” locuteur test Y “Bonjour” locuteur 2 “Bonjour” locuteur X “Bonjour” locuteur n Best path Dynamic Time Warping (DTW) DODDINGTON 1974, ROSENBERG 1976, FURUI 1981, etc.

  15. Dictionnaire locuteur 1 “Bonjour” locuteur test Y Dictionnaire locuteur 2 Dictionnaire locuteur X Dictionnaire locuteur n best quant. Vector Quantization (VQ) SOONG, ROSENBERG 1987

  16. “Bonjour” locuteur test Y “Bonjour” locuteur 1 “Bonjour” locuteur 2 “Bonjour” locuteur X “Bonjour” locuteur n Best path Hidden Markov Models (HMM) ROSENBERG 1990, TSENG 1992

  17. Ergodic HMM HMM locuteur 1 “Bonjour” locuteurtestY HMM locuteur2 HMM locuteurX HMM locuteurn Best path PORITZ 1982, SAVIC 1990

  18. Gaussian Mixture Models (GMM) REYNOLDS 1995

  19. HMM structure depends on the application

  20. Some issues in Text-dependent Speaker Verification Systems :The CAVE and PICASSO projects • Sequences of digits • Speaker independent HMM of each digit • Adaptation of these HMMs to the client voice (during enrolment and incremental enrolment) • EER of less than 1 % can be achieved • Customized password • The client chooses his password using some feedback from the system • Deliberate imposture

  21. Gaussian Mixture Model • Parametric representation of the probability distribution of observations:

  22. Gaussian Mixture Models 8 Gaussians per mixture

  23. WORLDGMMMODEL GMMMODELING WORLD DATA Front-end TARGETGMMMODEL TARGET SPEAKER GMM model adaptation Front-end GMM speaker modeling

  24. HYPOTH.TARGETGMM MOD. Front-end WORLDGMMMODEL Baseline GMM method l Test Speech = LLR SCORE

  25. Decision theory for identity verification • Two types of errors : • False rejection (a client is rejected) • False acceptation (an impostor is accepted) • Decision theory : given an observation O and a claimed identity • H0 hypothesis : it comes from an impostor • H1 hypothesis : it comes from our client • H1 is chosen if and only if P(H1|O) > P(H0|O) which could be rewritten (using Bayes law) as

  26. Signal detection theory

  27. Decision

  28. Distribution of scores

  29. Detection Error Tradeoff (DET) Curve

  30. Evaluation • Decision cost (FA, FR, priors, costs,…) • Receiver Operating Characteristic Curve • Reference systems (open software) • Evaluations (algorithms, field trials, ergonomy,…)

  31. NIST Speaker Verification Evaluations • A reference standard to compare algorithms and stimulate new developments • Distribution (via LDC) of development and test databases with : • Increasing difficulty (from land line to mobile) • Several hundreds of speakers (2 mn of training data per client), • Several thousands test accesses (5 to 50 sec per access), • Participation of 15-20 labs every year (MIT, IBM, Nuance, Queensland Univ, ELISA consortium,….) • Annual workshop, Special issues in Journals, …

  32. National Institute of Standards & Technology (NIST)Speaker Verification Evaluations • Annual evaluation since 1995 • Common paradigm for comparing technologies

  33. Speaker Verification (text independent) • The ELISA consortium • ENST, LIA, IRISA, ... • http://www.lia.univ-avignon.fr/equipes/RAL/elisa/index_en.html • BECARS : Balamand-ENST CEDRE Automatic Recognition of Speakers • NIST evaluations • http://www.nist.gov/speech/tests/spk/index.htm

  34. NIST evaluations : Results

  35. Evaluations: NIST 2004

  36. Combining Speech Recognition and Speaker Verification. • Speaker independent phone HMMs • Selection of segments or segment classes which are speaker specific • Preliminary evaluations are performed on the NIST extended data set (one hour of training data per speaker)

  37. ALISP : Automatic Language Independent Speech ProcessingData-driven speech segmentation

  38. Searching in client and world speech dictionaries for speaker verification purposes

  39. Fusion

  40. Fusion results

  41. Voice Transformations and Forgery (occasional, dedicated) • Isolated individuals with few resources or “professional impostors” with a dedicated budget can menace the security of speaker recognition systems • Voice transformation technologies (e.g. segmental synthesis using an inventory of client speech data) are nowadays available • Speaker recognition research should explicitly address this forgery issue and define appropriate countermeasures • Prevention by predicting many different forgery scenarios

  42. Voice Forgery using ALISP The same words or not transformation The same words or not client Impostor A modification of a source speaker‘s speech to imitate a target speaker

  43. HMM models Database of HNM Representatives MFCC analysis HMM recognition MFCC + delta Symbol index Harmonic envelope HNM Choice of the best representative unit - Representative index - DTW path Noise envelope Prosody (energy+pitch) Conversion system: ALISP encoder Speech

  44. Concatenation of HNM parameters for each representative HNM Synthesis Representative index DTW path Symbol index Speech signal Pitch, energy, timing Conversion system: ALISP Decoder

  45. Preliminary results: DET curves • Fabefore forgery: 16 ± 2.0 % (1700 files) • Faafter forgery: 26 ± 2.0 % (1700 files)

  46. Preliminary results True distributions

  47. Multimodal Identity Verification • M2VTS (face and speech) • front view and profile • pseudo-3D with coherent light • BIOMET: (face, speech, fingerprint, signature, hand shape) • data collection • reuse of the M2VTS and DAVID data bases • experiments on the fusion of modalities

  48. Speaking Faces : Motivations • In many situation a video sequence is acquired • Fusion of face and speech increases robustness • Forgery is more difficult

  49. Talking Face Recognition(hybrid verification)

  50. Lip features • Tracking lip movements

More Related