340 likes | 480 Views
AUTOMATIC SPEAKER RECOGNITION AND CHARACTERIZATION BY MEANS OF ROBUST VOCAL SOURCE FEATURES. Ph.D. Candidate: Enrico Marchetto Supervisor: Ph.D. F. Avanzini. XXIII Cycle. Ph.D. School of Information Engineering Department of Information Engineering University of Padova.
E N D
AUTOMATIC SPEAKER RECOGNITION AND CHARACTERIZATION BY MEANS OF ROBUST VOCAL SOURCE FEATURES Ph.D. Candidate: EnricoMarchetto Supervisor: Ph.D. F. Avanzini XXIII Cycle Ph.D. School of Information Engineering Department of Information Engineering University of Padova Padova - April 19th, 2011
Summary • Introduction • Background technologies • Innovative features • A complete Speaker Recognition System • Conclusions Dept. of Information Engineering - University of Padova
Introduction Dept. of Information Engineering - University of Padova
My PhD Activities • Start-up research project funded by the private-held company RT Radio Trevisan Elettronica – Trieste, Italy • Ph. D. Research • Visiting student at KTH – Stockholm, Nov. 2009 – Apr. 2010 • Sound and Music Com. Summer School, Porto, July 17-26 2009 • Presentations, Seminars and Posters • Seminar: “Speaker recognition for security and intelligence” - KTH-TKK, Oct. 12, 2009 • Presentation: “Physical modelling of the glottis and acoustic-to-articulatory inversion”, Coll. Inform. Musicale - Torino, Oct. 5, 2010 • “An automatic speaker recognition system for intelligence applications”, 17th European Signal Processing Conference - Glasgow, Aug. 24–28, 2009 Dept. of Information Engineering - University of Padova
My PhD Activities • “A spectral subtraction rule for real time DSP implementation of noise reduction in speech signals”, Digital Audio Effects – Como, Sept. 3, 2009 • Teaching activities • Co-supervisor of two master and one bachelor theses • Lecturer at First SaMPL Summer School: “Introduzione alla Voce Umana - Fisiologia, sintesi e restauro” – Padova, Sept. 24, 2010 • Projects and technology transfer • Among the winners of Start Cup Veneto 2010 • Among the founders of Bloopsrl, a private spin-off • Forum dellaRicerca e dell’Innovazione - Padova, May 7-29, 2009 • Involved in grant applications • Collaboration with 3D Everywhere, a Department spin-off Dept. of Information Engineering - University of Padova
Speaker Recognition • Automatic Speaker Verification or Detection: • Given a claimed identity and a voice, the system has to tell if voice and identity match each other • Automatic Speaker Identification • An identity has to be assigned to a given voice • Speaker Skimming and Diarization • Various applications: • Human-computer interaction • Automatic information structuring • Access control • Security and intelligence Dept. of Information Engineering - University of Padova
Recognition evaluations • Recognitions are evaluated in standard conditions • Speaker Recognition Evaluation (SRE) by the NIST • It is the reference evaluation among researchers • Mobile and landline telephone audio, many handsets • Quite challenging • Some standard databases available • SRE, Switchboard, TIMIT, … • Different levels of recognition difficulty • Steady improvements in the last 15 years • A wide and active community is working Dept. of Information Engineering - University of Padova
Background technologies Dept. of Information Engineering - University of Padova
Audio Features • Numeric characterization of voice/audio • Many types exist, both for recognition and coding • Mel-Frequency Cepstral Coefficients (MFCC) • Cepstrum with enhanced frequency resolution • Psychoacoustically motivated • High performance on clean audio MFCCs are usually augmented with their Delta features to retain dynamics
Statistical modelling • Many tools are used in Speaker Recognition • Most used: Gaussian Mixtures • An MFCC vector is extracted from each input audio frame • Each vector is modelled as a point in high-dimensional space • The evolution in time is lost • GMMs model the features as a whole probability distribution • Good to avoid speech modeling
Likelihood ratio • The recognition is built on an Hypothesis Test • Hypothesized speaker (H0) model likelihood compared with likelihood from a background model • A Score is obtained from the ratio between the log-likelihoods of the two • H0 accepted if the score is above a threshold θ • Trade-off between Miss and False Positive errors • Ad-hoc DET diagrams to show performances A Universal Background Model is needed: it models the impostors voices Dept. of Information Engineering - University of Padova
Detection Error Trade-off • These diagrams show the trade-off between the False positives and the Miss errors • Scoring using a database key file • The scores are divided in two classes: true speakers (the hypothesized speaker is correct) and impostors • The classes are modelled as two normal distributions • The DET curve is lower when the two distributions have different means → better class discrimination • Threshold θ is selected accordingly to some optimality criteria: a Detection Error Cost Function can be defined • The main diagonal identifies the Equal Error Rate point
Innovative features Dept. of Information Engineering - University of Padova
The glottal signal • Source of voiced sounds • Not only vowels • The flow is information-rich • Speaker gender, age, mood and identity • High intra- and inter- speaker variability • Produced by the vocal folds • Air is forced through the closed glottis • Pressure rises and the folds open, so there is air-jet • Folds elastic force abruptly closes the glottis Dept. of Information Engineering - University of Padova
Inverse vocal tract filtering • Speech as glottal flow convolved with vocal tract • Usually termed source-filter model • Vowels identified by resonances / formants of the tract • Several Inverse Filtering techniques tested • Inverse Adaptive Iterative Filtering • Roughly estimates the glottal pulse • Estimates the vocal tract response • Refine the glottal pulse estimate
Innovative glottal features /1 Feature comparisons by N-to-N distance measure No statistical model train/test involved so far • Source features, obtained from the glottal flow • Several IAIF runs to obtain many glottal pulses • Estimated flows are quite noisy, not always reliable • Unlikely pulses are removed • To make the estimates more robust: • K-Means clustering is applied (similar flows are grouped) • Only cluster centroids (means) are kept • Centroids sorted by increasing pitch and stored
Innovative glottal features /2 • Example of source features Sorted glottal flows • Examples of glottal flows • Three flows taken from the left figure • Each row represents a Prototypical Flow • 32 flows per Speaker • Bayesian Information Criterion Dept. of Information Engineering - University of Padova
Features performances /1 • Source features performs better than MFCCs • On clean audio MFCCs outperform the novel features • Babble/white noise → 6/10% EER improvement • Less sensitive to noisy conditions • EER loss 15% less than MFCCs when noise is added • Fusion scores sometimes better → Source features describe information not present in MFCCs Results from downsampled TIMIT database with additive noise
Features performances /2 Submitted to Interspeech 2011 Conference • The Glottal features exhibit improved performances • The improvement is more noticeable with Babble noise: the DET curve is substantially closer to the origin • The Glottal features have best performances with Babble noise • Best MFCCs performances are on White noise Downsampled TIMIT SNR 10dB • Resulting DET diagram Dept. of Information Engineering - University of Padova
Physical model inversion Submitted to IEEE Trans. on Audio, Speech and Language Processing • Simulation of glottal signal by a physical model • Another work path: voice synthesis and coding • A codebook has been compiled • Couples [activations; acous. par.] • Its inversion is not univocal • Finding optimal activations path • Dynamic programming techniques • A cost function is minimized Muscle activations Acoustic parameters Physiological parameters Simulated glottal flow Non-linear model of the glottis Mechanical parameters
A completeSpeaker Recognition System Dept. of Information Engineering - University of Padova
System layout • A number of reference Speaker Databases are supported • Advanced configuration and assessment system • Some modules are presented in the next slides… • Many modules, all developed from scratch • The system is highly customizable and flexible • It is reasonably efficient, but is still suitable for research, being quickly reconfigurable Dept. of Information Engineering - University of Padova
Voice activity detector • Given a noisy record the system has to tell when someone is speaking • VAD effectiveness is crucial for good noise estimation in Noise Reduction • Wavelet analysis and Teager Energy Operator • Speech Activity Envelope (SAE) and Voice Activity Shape (VAS) functions • Based on wavelet subband energy autocorrelation and Teager Energy
Noise Reduction Digital Audio Effects DAFX’09 “A spectral sub-traction rule for real time DSP implementa-tion of noise reduction in speech signals” • An approximated Ephraim and Malah spectral subtraction rule has been proposed • Real-time on hardware DSP • Noisy audio records can be restored • Adaptive estimation and subtraction of the noise spectrum I0, I1 are the Modified Bessel Functions Approximated E&M formula, with lower computational load
Phonetic GMM Europ. Signal Processing Conference EUSIPCO’09 “An automatic speaker recognition system for intelligence applications” • MFCC and GMM models the voice as a whole • Actually the speech is highly non-stationary • The proposed idea is to segment the voice in acoustic classes, with homogeneous sounds • Each class is trained/tested on its GMM • Better GMM performances: uniform feature space • Three and five acoustic classes studied • Improvements on clean audio DET diagram on clean TIMIT database. Solid line: 32 and 4 gaussians per speaker; dotted: 3 classes; dashed: 5 classes.
Conclusions • A novel glottal feature set has been proposed • Good performances on noisy signals, better than MFCCs • A complete Speaker Recognition systems has been implemented, without third-party modules • Physical modelling of the glottis • A model inversion technique has been developed • Results have been presented at international conferences • A paper has been submitted to IEEE Transactions on Audio, Speech and Language Processing • Also submitted a conference paper to Interspeech 2011 Dept. of Information Engineering - University of Padova
Ongoing work • The features could be improved • Room for enhancements in the clustering procedure • A specific “shape distance” could be better • Extended performance evaluations • Source Features has to be tested on NIST SRE • An assessment on telephone line issues is needed • Private held companies are interested in the work • A conference paper about 2D Cepstrum features is in preparation with KTH researchers Dept. of Information Engineering - University of Padova
Thank you Thank you, any question is welcome! Dept. of Information Engineering - University of Padova
Statistical modelling /2 • How to compute the probability of H1? • A non-target model is needed • Among others, the most successful approach is the Universal Background Model • A big GMM trained with speech from many speakers • The UBM is not so Universal • The UBM should be gender and channel specific • Model Adaptation as an alternative train strategy Dept. of Information Engineering - University of Padova
Speaker Diarization • Speaker Diarization: given a multiple-speakers record, system has to tell who speaks when • Subtasks: speaker segmentation and clustering • Segmentation based on a BIC criterion • Each audio frame is modeled using one or two gaussians, detecting the speaker changes • Clustering with K-Means algorithm • Other approaches: Resegmentation, hierarchical clustering,etc.
Automatic Segmentation • The speech is segmented in acoustic classes using the manual transcriptions of a database • Obviously not an available information in real-world • Automatic techniques are needed for the segmentation • Phonetic recognizers exists, but are complex and error-prone • Development of an acoustic segmentation • Only classes needed, not phonemes • Based on a combination of signal features Dept. of Information Engineering - University of Padova
2D Cepstrum • Proposed as an evolution of MFCC • Cepstrum applied to frequency and time • Filterbank approximates the critical bands of hearing • Compact: as in MFCC, a Discrete Cosine Transform provides coefficients de-correlation • Theoretically motivated • The dynamic information is included in the features by definition, without the Delta fixup Dept. of Information Engineering - University of Padova
Security applications • Motivations in wiretapping context: • Waste of time: Authorities have to listen a lot of useless telephone conversations • Inefficiency: a perpetrator voice may be miss • A voice skimming system is proposed, which automatically detects interesting voices • The application responds to an effective need • State-of-the-Art technology is adequate • System errors are not critical Dept. of Information Engineering - University of Padova
Phone line • Speaker Recognition over phone lines has a number of difficulties • Channel gain and phase not fully characterized • Additive noises of various types • Narrowband voice signal 300-3400Hz • Possible codec artifacts (eg mobile phones) • Such applications needs robust techniques • Both features and modelling refinements • Development of novel, specific audio features Dept. of Information Engineering - University of Padova