1 / 34

AUTOMATIC SPEAKER RECOGNITION AND CHARACTERIZATION BY MEANS OF ROBUST VOCAL SOURCE FEATURES

AUTOMATIC SPEAKER RECOGNITION AND CHARACTERIZATION BY MEANS OF ROBUST VOCAL SOURCE FEATURES. Ph.D. Candidate: Enrico Marchetto Supervisor: Ph.D. F. Avanzini. XXIII Cycle. Ph.D. School of Information Engineering Department of Information Engineering University of Padova.

nerice
Download Presentation

AUTOMATIC SPEAKER RECOGNITION AND CHARACTERIZATION BY MEANS OF ROBUST VOCAL SOURCE FEATURES

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AUTOMATIC SPEAKER RECOGNITION AND CHARACTERIZATION BY MEANS OF ROBUST VOCAL SOURCE FEATURES Ph.D. Candidate: EnricoMarchetto Supervisor: Ph.D. F. Avanzini XXIII Cycle Ph.D. School of Information Engineering Department of Information Engineering University of Padova Padova - April 19th, 2011

  2. Summary • Introduction • Background technologies • Innovative features • A complete Speaker Recognition System • Conclusions Dept. of Information Engineering - University of Padova

  3. Introduction Dept. of Information Engineering - University of Padova

  4. My PhD Activities • Start-up research project funded by the private-held company RT Radio Trevisan Elettronica – Trieste, Italy • Ph. D. Research • Visiting student at KTH – Stockholm, Nov. 2009 – Apr. 2010 • Sound and Music Com. Summer School, Porto, July 17-26 2009 • Presentations, Seminars and Posters • Seminar: “Speaker recognition for security and intelligence” - KTH-TKK, Oct. 12, 2009 • Presentation: “Physical modelling of the glottis and acoustic-to-articulatory inversion”, Coll. Inform. Musicale - Torino, Oct. 5, 2010 • “An automatic speaker recognition system for intelligence applications”, 17th European Signal Processing Conference - Glasgow, Aug. 24–28, 2009 Dept. of Information Engineering - University of Padova

  5. My PhD Activities • “A spectral subtraction rule for real time DSP implementation of noise reduction in speech signals”, Digital Audio Effects – Como, Sept. 3, 2009 • Teaching activities • Co-supervisor of two master and one bachelor theses • Lecturer at First SaMPL Summer School: “Introduzione alla Voce Umana - Fisiologia, sintesi e restauro” – Padova, Sept. 24, 2010 • Projects and technology transfer • Among the winners of Start Cup Veneto 2010 • Among the founders of Bloopsrl, a private spin-off • Forum dellaRicerca e dell’Innovazione - Padova, May 7-29, 2009 • Involved in grant applications • Collaboration with 3D Everywhere, a Department spin-off Dept. of Information Engineering - University of Padova

  6. Speaker Recognition • Automatic Speaker Verification or Detection: • Given a claimed identity and a voice, the system has to tell if voice and identity match each other • Automatic Speaker Identification • An identity has to be assigned to a given voice • Speaker Skimming and Diarization • Various applications: • Human-computer interaction • Automatic information structuring • Access control • Security and intelligence Dept. of Information Engineering - University of Padova

  7. Recognition evaluations • Recognitions are evaluated in standard conditions • Speaker Recognition Evaluation (SRE) by the NIST • It is the reference evaluation among researchers • Mobile and landline telephone audio, many handsets • Quite challenging • Some standard databases available • SRE, Switchboard, TIMIT, … • Different levels of recognition difficulty • Steady improvements in the last 15 years • A wide and active community is working Dept. of Information Engineering - University of Padova

  8. Background technologies Dept. of Information Engineering - University of Padova

  9. Audio Features • Numeric characterization of voice/audio • Many types exist, both for recognition and coding • Mel-Frequency Cepstral Coefficients (MFCC) • Cepstrum with enhanced frequency resolution • Psychoacoustically motivated • High performance on clean audio MFCCs are usually augmented with their Delta features to retain dynamics

  10. Statistical modelling • Many tools are used in Speaker Recognition • Most used: Gaussian Mixtures • An MFCC vector is extracted from each input audio frame • Each vector is modelled as a point in high-dimensional space • The evolution in time is lost • GMMs model the features as a whole probability distribution • Good to avoid speech modeling

  11. Likelihood ratio • The recognition is built on an Hypothesis Test • Hypothesized speaker (H0) model likelihood compared with likelihood from a background model • A Score is obtained from the ratio between the log-likelihoods of the two • H0 accepted if the score is above a threshold θ • Trade-off between Miss and False Positive errors • Ad-hoc DET diagrams to show performances A Universal Background Model is needed: it models the impostors voices Dept. of Information Engineering - University of Padova

  12. Detection Error Trade-off • These diagrams show the trade-off between the False positives and the Miss errors • Scoring using a database key file • The scores are divided in two classes: true speakers (the hypothesized speaker is correct) and impostors • The classes are modelled as two normal distributions • The DET curve is lower when the two distributions have different means → better class discrimination • Threshold θ is selected accordingly to some optimality criteria: a Detection Error Cost Function can be defined • The main diagonal identifies the Equal Error Rate point

  13. Innovative features Dept. of Information Engineering - University of Padova

  14. The glottal signal • Source of voiced sounds • Not only vowels • The flow is information-rich • Speaker gender, age, mood and identity • High intra- and inter- speaker variability • Produced by the vocal folds • Air is forced through the closed glottis • Pressure rises and the folds open, so there is air-jet • Folds elastic force abruptly closes the glottis Dept. of Information Engineering - University of Padova

  15. Inverse vocal tract filtering • Speech as glottal flow convolved with vocal tract • Usually termed source-filter model • Vowels identified by resonances / formants of the tract • Several Inverse Filtering techniques tested • Inverse Adaptive Iterative Filtering • Roughly estimates the glottal pulse • Estimates the vocal tract response • Refine the glottal pulse estimate

  16. Innovative glottal features /1 Feature comparisons by N-to-N distance measure No statistical model train/test involved so far • Source features, obtained from the glottal flow • Several IAIF runs to obtain many glottal pulses • Estimated flows are quite noisy, not always reliable • Unlikely pulses are removed • To make the estimates more robust: • K-Means clustering is applied (similar flows are grouped) • Only cluster centroids (means) are kept • Centroids sorted by increasing pitch and stored

  17. Innovative glottal features /2 • Example of source features Sorted glottal flows • Examples of glottal flows • Three flows taken from the left figure • Each row represents a Prototypical Flow • 32 flows per Speaker • Bayesian Information Criterion Dept. of Information Engineering - University of Padova

  18. Features performances /1 • Source features performs better than MFCCs • On clean audio MFCCs outperform the novel features • Babble/white noise → 6/10% EER improvement • Less sensitive to noisy conditions • EER loss 15% less than MFCCs when noise is added • Fusion scores sometimes better → Source features describe information not present in MFCCs Results from downsampled TIMIT database with additive noise

  19. Features performances /2 Submitted to Interspeech 2011 Conference • The Glottal features exhibit improved performances • The improvement is more noticeable with Babble noise: the DET curve is substantially closer to the origin • The Glottal features have best performances with Babble noise • Best MFCCs performances are on White noise Downsampled TIMIT SNR 10dB • Resulting DET diagram Dept. of Information Engineering - University of Padova

  20. Physical model inversion Submitted to IEEE Trans. on Audio, Speech and Language Processing • Simulation of glottal signal by a physical model • Another work path: voice synthesis and coding • A codebook has been compiled • Couples [activations; acous. par.] • Its inversion is not univocal • Finding optimal activations path • Dynamic programming techniques • A cost function is minimized Muscle activations Acoustic parameters Physiological parameters Simulated glottal flow Non-linear model of the glottis Mechanical parameters

  21. A completeSpeaker Recognition System Dept. of Information Engineering - University of Padova

  22. System layout • A number of reference Speaker Databases are supported • Advanced configuration and assessment system • Some modules are presented in the next slides… • Many modules, all developed from scratch • The system is highly customizable and flexible • It is reasonably efficient, but is still suitable for research, being quickly reconfigurable Dept. of Information Engineering - University of Padova

  23. Voice activity detector • Given a noisy record the system has to tell when someone is speaking • VAD effectiveness is crucial for good noise estimation in Noise Reduction • Wavelet analysis and Teager Energy Operator • Speech Activity Envelope (SAE) and Voice Activity Shape (VAS) functions • Based on wavelet subband energy autocorrelation and Teager Energy

  24. Noise Reduction Digital Audio Effects DAFX’09 “A spectral sub-traction rule for real time DSP implementa-tion of noise reduction in speech signals” • An approximated Ephraim and Malah spectral subtraction rule has been proposed • Real-time on hardware DSP • Noisy audio records can be restored • Adaptive estimation and subtraction of the noise spectrum I0, I1 are the Modified Bessel Functions Approximated E&M formula, with lower computational load

  25. Phonetic GMM Europ. Signal Processing Conference EUSIPCO’09 “An automatic speaker recognition system for intelligence applications” • MFCC and GMM models the voice as a whole • Actually the speech is highly non-stationary • The proposed idea is to segment the voice in acoustic classes, with homogeneous sounds • Each class is trained/tested on its GMM • Better GMM performances: uniform feature space • Three and five acoustic classes studied • Improvements on clean audio DET diagram on clean TIMIT database. Solid line: 32 and 4 gaussians per speaker; dotted: 3 classes; dashed: 5 classes.

  26. Conclusions • A novel glottal feature set has been proposed • Good performances on noisy signals, better than MFCCs • A complete Speaker Recognition systems has been implemented, without third-party modules • Physical modelling of the glottis • A model inversion technique has been developed • Results have been presented at international conferences • A paper has been submitted to IEEE Transactions on Audio, Speech and Language Processing • Also submitted a conference paper to Interspeech 2011 Dept. of Information Engineering - University of Padova

  27. Ongoing work • The features could be improved • Room for enhancements in the clustering procedure • A specific “shape distance” could be better • Extended performance evaluations • Source Features has to be tested on NIST SRE • An assessment on telephone line issues is needed • Private held companies are interested in the work • A conference paper about 2D Cepstrum features is in preparation with KTH researchers Dept. of Information Engineering - University of Padova

  28. Thank you Thank you, any question is welcome! Dept. of Information Engineering - University of Padova

  29. Statistical modelling /2 • How to compute the probability of H1? • A non-target model is needed • Among others, the most successful approach is the Universal Background Model • A big GMM trained with speech from many speakers • The UBM is not so Universal • The UBM should be gender and channel specific • Model Adaptation as an alternative train strategy Dept. of Information Engineering - University of Padova

  30. Speaker Diarization • Speaker Diarization: given a multiple-speakers record, system has to tell who speaks when • Subtasks: speaker segmentation and clustering • Segmentation based on a BIC criterion • Each audio frame is modeled using one or two gaussians, detecting the speaker changes • Clustering with K-Means algorithm • Other approaches: Resegmentation, hierarchical clustering,etc.

  31. Automatic Segmentation • The speech is segmented in acoustic classes using the manual transcriptions of a database • Obviously not an available information in real-world • Automatic techniques are needed for the segmentation • Phonetic recognizers exists, but are complex and error-prone • Development of an acoustic segmentation • Only classes needed, not phonemes • Based on a combination of signal features Dept. of Information Engineering - University of Padova

  32. 2D Cepstrum • Proposed as an evolution of MFCC • Cepstrum applied to frequency and time • Filterbank approximates the critical bands of hearing • Compact: as in MFCC, a Discrete Cosine Transform provides coefficients de-correlation • Theoretically motivated • The dynamic information is included in the features by definition, without the Delta fixup Dept. of Information Engineering - University of Padova

  33. Security applications • Motivations in wiretapping context: • Waste of time: Authorities have to listen a lot of useless telephone conversations • Inefficiency: a perpetrator voice may be miss • A voice skimming system is proposed, which automatically detects interesting voices • The application responds to an effective need • State-of-the-Art technology is adequate • System errors are not critical Dept. of Information Engineering - University of Padova

  34. Phone line • Speaker Recognition over phone lines has a number of difficulties • Channel gain and phase not fully characterized • Additive noises of various types • Narrowband voice signal 300-3400Hz • Possible codec artifacts (eg mobile phones) • Such applications needs robust techniques • Both features and modelling refinements • Development of novel, specific audio features Dept. of Information Engineering - University of Padova

More Related