550 likes | 677 Views
Cours parole du 9 Mars 2005 enseignants: Dr. Dijana Petrovska-Delacrétaz et Gérard Chollet. Reconnaissance du locuteur Introduction, Historique, Domaines d’applications Les indices de l’identité dans la parole Vérification du locuteur Théorie de la decision
E N D
Cours parole du 9 Mars 2005enseignants: Dr. Dijana Petrovska-Delacrétazet Gérard Chollet Reconnaissance du locuteur • Introduction, Historique, Domaines d’applications • Les indices de l’identité dans la parole • Vérification du locuteur • Théorie de la decision • Dépendante / Indépendante du texte • L’imposture vocale • Vérification audio-visuelle de l’identité • Evaluations • Conclusions
Why should a computer recognize who is speaking ? • Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...) • Limited access (secured areas, data bases) • Personalization (only respond to its master’s voice) • Locate a particular person in an audio-visual document (information retrieval) • Who is speaking in a meeting ? • Is a suspect the criminal ? (forensic applications)
Tasks in Automatic Speaker Recognition • Speaker verification (Voice Biometrics) • Are you really who you claim to be ? • Identification (Speaker ID) : • Is this speech segment coming from a known speaker ? • How large is the set of speakers (population of the world) ? • Speaker detection, segmentation, indexing, retrieval, tracking : • Looking for recordings of a particular speaker • Combining Speech and Speaker Recognition • Adaptation to a new speaker, speaker typology • Personalization in dialogue systems
Applications • Access Control • Physical facilities, Computer networks, Websites • Transaction Authentication • Telephone banking, e-Commerce • Speech data Management • Voice messaging, Search engines • Law Enforcement • Forensics, Home incarceration
Voice Biometric • Avantages • Often the only modality over the telephone, • Low cost (microphone, A/D), Ubiquity • Possible integration on a smart (SIM) card • Natural bimodal fusion : speaking face • Disadvantages • Lack of discretion • Possibility of imitation and electronic imposture • Lack of robustness to noise, distortion,… • Temporal drift
Speaker Identity in Speech • Differences in • Vocal tract shapes and muscular control • Fundamental frequency (typical values) • 100 Hz (Male), 200 Hz (Female), 300 Hz (Child) • Glottal waveform • Phonotactics • Lexical usage • The differences between Voices of Twins is a limit case • Voices can also be imitated or disguised
Speaker Identity • suprasegmental factors • speaking speed (timing and rhythm of speech units) • intonation patterns • dialect, accent, pronunciation habits spectral envelope of / i: / • segmental factors (~30ms) • glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness) • vocal tract:characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef) Speaker A Speaker B A f
What are the sources of difficulty ? • Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…) • Recording conditions (filtering, noise,…) • Channel mismatch between enrolment and testing • Temporal drift • Intentional imposture • Voice disguise
Acoustic features • Short term spectral analysis
Speaker Verification • Typology of approaches (EAGLES Handbook) • Text dependent • Public password • Private password • Customized password • Text prompted • Text independent • Incremental enrolment • Evaluation
“Bonjour” locuteur 1 “Bonjour” locuteur test Y “Bonjour” locuteur 2 “Bonjour” locuteur X “Bonjour” locuteur n Best path Dynamic Time Warping (DTW) DODDINGTON 1974, ROSENBERG 1976, FURUI 1981, etc.
Dictionnaire locuteur 1 “Bonjour” locuteur test Y Dictionnaire locuteur 2 Dictionnaire locuteur X Dictionnaire locuteur n best quant. Vector Quantization (VQ) SOONG, ROSENBERG 1987
“Bonjour” locuteur test Y “Bonjour” locuteur 1 “Bonjour” locuteur 2 “Bonjour” locuteur X “Bonjour” locuteur n Best path Hidden Markov Models (HMM) ROSENBERG 1990, TSENG 1992
Ergodic HMM HMM locuteur 1 “Bonjour” locuteurtestY HMM locuteur2 HMM locuteurX HMM locuteurn Best path PORITZ 1982, SAVIC 1990
Gaussian Mixture Models (GMM) REYNOLDS 1995
Some issues in Text-dependent Speaker Verification Systems :The CAVE and PICASSO projects • Sequences of digits • Speaker independent HMM of each digit • Adaptation of these HMMs to the client voice (during enrolment and incremental enrolment) • EER of less than 1 % can be achieved • Customized password • The client chooses his password using some feedback from the system • Deliberate imposture
Gaussian Mixture Model • Parametric representation of the probability distribution of observations:
Gaussian Mixture Models 8 Gaussians per mixture
WORLDGMMMODEL GMMMODELING WORLD DATA Front-end TARGETGMMMODEL TARGET SPEAKER GMM model adaptation Front-end GMM speaker modeling
HYPOTH.TARGETGMM MOD. Front-end WORLDGMMMODEL Baseline GMM method l Test Speech = LLR SCORE
Decision theory for identity verification • Two types of errors : • False rejection (a client is rejected) • False acceptation (an impostor is accepted) • Decision theory : given an observation O and a claimed identity • H0 hypothesis : it comes from an impostor • H1 hypothesis : it comes from our client • H1 is chosen if and only if P(H1|O) > P(H0|O) which could be rewritten (using Bayes law) as
Evaluation • Decision cost (FA, FR, priors, costs,…) • Receiver Operating Characteristic Curve • Reference systems (open software) • Evaluations (algorithms, field trials, ergonomy,…)
NIST Speaker Verification Evaluations • A reference standard to compare algorithms and stimulate new developments • Distribution (via LDC) of development and test databases with : • Increasing difficulty (from land line to mobile) • Several hundreds of speakers (2 mn of training data per client), • Several thousands test accesses (5 to 50 sec per access), • Participation of 15-20 labs every year (MIT, IBM, Nuance, Queensland Univ, ELISA consortium,….) • Annual workshop, Special issues in Journals, …
National Institute of Standards & Technology (NIST)Speaker Verification Evaluations • Annual evaluation since 1995 • Common paradigm for comparing technologies
Speaker Verification (text independent) • The ELISA consortium • ENST, LIA, IRISA, ... • http://www.lia.univ-avignon.fr/equipes/RAL/elisa/index_en.html • BECARS : Balamand-ENST CEDRE Automatic Recognition of Speakers • NIST evaluations • http://www.nist.gov/speech/tests/spk/index.htm
Combining Speech Recognition and Speaker Verification. • Speaker independent phone HMMs • Selection of segments or segment classes which are speaker specific • Preliminary evaluations are performed on the NIST extended data set (one hour of training data per speaker)
ALISP : Automatic Language Independent Speech ProcessingData-driven speech segmentation
Searching in client and world speech dictionaries for speaker verification purposes
Voice Transformations and Forgery (occasional, dedicated) • Isolated individuals with few resources or “professional impostors” with a dedicated budget can menace the security of speaker recognition systems • Voice transformation technologies (e.g. segmental synthesis using an inventory of client speech data) are nowadays available • Speaker recognition research should explicitly address this forgery issue and define appropriate countermeasures • Prevention by predicting many different forgery scenarios
Voice Forgery using ALISP The same words or not transformation The same words or not client Impostor A modification of a source speaker‘s speech to imitate a target speaker
HMM models Database of HNM Representatives MFCC analysis HMM recognition MFCC + delta Symbol index Harmonic envelope HNM Choice of the best representative unit - Representative index - DTW path Noise envelope Prosody (energy+pitch) Conversion system: ALISP encoder Speech
Concatenation of HNM parameters for each representative HNM Synthesis Representative index DTW path Symbol index Speech signal Pitch, energy, timing Conversion system: ALISP Decoder
Preliminary results: DET curves • Fabefore forgery: 16 ± 2.0 % (1700 files) • Faafter forgery: 26 ± 2.0 % (1700 files)
Preliminary results True distributions
Multimodal Identity Verification • M2VTS (face and speech) • front view and profile • pseudo-3D with coherent light • BIOMET: (face, speech, fingerprint, signature, hand shape) • data collection • reuse of the M2VTS and DAVID data bases • experiments on the fusion of modalities
Speaking Faces : Motivations • In many situation a video sequence is acquired • Fusion of face and speech increases robustness • Forgery is more difficult
Lip features • Tracking lip movements