420 likes | 689 Views
Speaker Recognition. G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ ( chollet, kharroub,petrovsk ) @ tsi.enst.fr ggravier @ infres.enst.fr ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 http://www.tsi.enst.fr/~chollet. Our affiliations.
E N D
Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.frggravier@infres.enst.fr ENST/CNRS-LTCI46 rue Barrault75634 PARIS cedex 13http://www.tsi.enst.fr/~chollet
Our affiliations ENST:Ecole Nationale Supérieure des Télécommunicationshttp://www.enst.fr CNRS:Centre National de la Recherche Scientifiquehttp://www.cnrs.fr LTCI:Laboratoire de Traitement et Communication de l’Information http://www.enst.fr/ura/ura.html
What is ENST?Ecole Nationale Supérieure des Télécommunications • classed among the • ‘Grandes Ecoles d'Ingénieurs’. • 250 state certified engineers • each year . • part of ‘Groupement des Ecoles • de Télécommunications’
PIN 111111111 SECURED SPACE Bla-bla Modalities for Identity Verification
Modalities for Identity Verification • A device you own (key, smart card,…) A code you remember (password, …) • Could be lost or stolen • Physiological characteristics: • Face, iris, finger print, hand shape,… • Need special equipment • Behavioral characteristics: • Speech, signature, keystroke,… • Speech is the prefered modality over the telephone (but a ‘voice print’ is much more variable than a finger print)
Outline • Where is the information about the speaker identity in the speech signal ? • How well could humans recognize a speaker ? • Applications of Speaker Recognition • Prior knowledge on what the speaker said • Combining Speech Recognition and Speaker Verification • Some research activities at ENST: • Speaker verification: • The CAVE-PICASSO projects (text dependent) • The ELISA consortium, NIST evaluations (text independent) • The EUREKA !2340 MAJORDOME project • Multimodal Identity Verification: • The M2VTS and BIOMET projects • Perspectives
Speaker Identity in Speech • Differences in • Vocal tract shapes and muscular control • Fundamental frequency (typical values) • 100 Hz (Male), 200 Hz (Female), 300 Hz (Child) • Glottal waveform • Phonotactics • Lexical usage • The differences between Voices of Twins is a limit case • Voices can also be imitated or disguised
Speaker Identity • suprasegmental factors • speaking speed (timing and rhythm of speech units) • intonation patterns • dialect, accent, pronunciation habits • segmental factors (~30ms) • glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness) • vocal tract:formant frequenciesand bandwidths spectral envelope of / i: / Speaker A Speaker B A f
Inter-speaker Variability We were away a year ago.
Intra-speaker Variability We were away a year ago.
Glottal Waveform Modeling • Fitting a glottal pulse model to the excitation waveform allows perceptually relevant modifications to voice quality A t original residual: bluesynthetic residual: red
Applications of Speaker Recognition • Identification from an open set (unrealistic) • Identification from a closed set (who is speaking in a videoconference ?) • Verification of claimed identity (risk of deliberate imposture) The human performance in speaker recognition is far from being perfect (highly dependent on familiarity with the subject)
Speaker Verification • Typology of approaches (EAGLES Handbook) • Text dependent • Public password • Private password • Customized password • Text prompted • Text independent • Incremental enrolment • Evaluation
What are the sources of difficulty ? • Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…) • Recording conditions (filtering, noise,…) • Temporal drift • Intentional imposture • Voice disguise
Text-dependent Speaker Verification • Uses Automatic Speech Recognition techniques (DTW, HMM, …) • Client model adaptation from speaker independent HMM (‘World’ model) • Synchronous alignment of client and world models for the computation of a score.
Score normalisation • World model • Cohort normalisation • Discriminant techniques
CAVE – PICASSO http://www.picasso.ptt-telecom.nl/project/
Incremental enrolment of customised password • The client chooses his password using some feedback from the system. • The system attempts a phonetic transcription of the password. • Incremental enrolment is achieved on further repetitions of that password • Speaker independent phone HMM are adapted with the client enrolment data. • Synchronous alignment likelihood ratio scoring is performed on access trials.
Deliberate imposture • The impostor has some recordings of the target client voice. He can record the same sentences and align these speech signals with the recordings of the client. • A transformation (Multiple Linear Regression) is computed from these aligned data. • The impostor has heard the target client password. • He records that password and applies the transformation to this recording. • The PICASSO reference system with less than 1 % EER is defeated by this procedure (more than 30 % EER)
Speaker Verification (text independent) • The ELISA consortium • ENST, LIA, IRISA, ... • http://www.lia.univ-avignon.fr/equipes/RAL/elisa/index_en.html • NIST evaluations • http://www.nist.gov/speech/tests/spk/index.htm • Ergodic HMM • Gaussian Mixture Model
Gaussian Mixture Model • Parametric representation of the probability distribution of observations:
Gaussian Mixture Models 8 Gaussians per mixture
National Institute of Standards & Technology (NIST)Speaker Verification Evaluations • Annual evaluation since 1995 • Common paradigm for comparing technologies
WORLDGMMMODEL GMMMODELING WORLD DATA Front-end TARGETGMMMODEL TARGET SPEAKER GMM model adaptation Front-end GMM speaker modeling
HYPOTH.TARGETGMM MOD. Front-end WORLDGMMMODEL Baseline GMM method l Test Speech = LLR SCORE
GMM Modeling Scoring SVM Support Vector Machines and Speaker Verification • Hybrid GMM-SVM system is proposed • SVM scoring model trained on development data to classify true-target speakers access and impostors access,using new feature representation based on GMMs
Separating hyperplans H , with the optimal hyperplan Ho Feature space Input space H y(X) X Class(X) Ho SVM principles
Combining Speech Recognition and Speaker Verification. • Speaker independent phone HMMs • Selection of segments or segment classes which are speaker specific • Preliminary evaluations are performed on the NIST extended data set (one hour of training data per speaker)
Selection of nasals in words in -ing being everything getting anything thingsomething things going
Vecsys EDF Software602 KTH Mensatec UPC Airtel «MAJORDOME» Unified Messaging System Eureka Projet no 2340 D. Bahu-Leyser, G. Chollet, K. Hallouli , J. Kharroubi, L. Likforman, D. Mostefa, D. Petrovska, M. Sigelle, P. Vaillant
MAJORDOME ( E-mail • Speaker verification • Dialogue • Routing • Updating the agenda • Automatic summary Voice Fax Majordome’s Functionalities
Voice technology in Majordome • Server side background tasks: continuous speech recognition applied to voice messages upon reception • Detection of sender’s name and subject • User interaction: • Speaker identification and verification • Speech recognition (receiving user commands through voice interaction) • Text-to-speech synthesis (reading text summaries, E-mails or faxes)
PIN 111111111 SECURED SPACE Bla-bla BIOMET
BIOMET • An extension of the M2VTS and DAVID projects to include such modalities as signature, finger print, hand shape. • Initial support (two years) is provided by GET (Groupement des Ecoles de Télécommunications) • Emphasis will be on fusion of scores obtained from two or more modalities.
Conclusions and Perspectives • Evaluation trials (as conducted by NIST) help improve technology. • A strategy combining speech recognition and segmental scoring seems to be a promissing approach for speaker verification. • Whenever possible, text independent speaker verification should be confirmed by text dependent verification. • Whenever possible, fusion of multiple experts (preferably multimodal) should be performed.