1 / 19

EE225D Final Project

Text-Constrained Speaker Recognition Using Hidden Markov Models. Kofi A. Boakye EE225D Final Project. EE225D Final Project. Introduction. Speaker Recognition Problem: Determine if spoken segment is putative target Method of Solution Requires Two Parts: Training Testing

bonniep
Download Presentation

EE225D Final Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project EE225D Final Project

  2. Introduction • Speaker Recognition Problem: Determine if spoken segment is putative target • Method of Solution Requires Two Parts: • Training • Testing • Similar to speech recognition though noise (inter-speaker variability) is now signal. EE225D Final Project

  3. Introduction • Also like speech recognition, different domains exist • Two major divisions: • Text-dependent/Text-constrained • Text-independent • Text-dependent systems can have high performance because of input constraints • More of the acoustic variation is speaker-distinctive EE225D Final Project

  4. Introduction Question: Is it possible to capitalize on advantages of text-dependent systems in text-independent domains? Answer: Yes! EE225D Final Project

  5. Introduction Idea: Limit words of interest to a select group-Words should have high frequency in domain-Words should have high speaker-discriminative quality What kind of words match these criteria for conversational speech ?1) Discourse markers (um, uh, like, …)2) Backchannels (yeah, right, uhhuh, …) These words are fairly spontaneous and represent an “involuntary speaking style” (Heck, WS2002) EE225D Final Project

  6. Design Likelihood Ratio Detector: Λ = p(X|S) /p(X|UBM) Task is a detection problem, so use likelihood ratio detector -In implementation, log-likelihood is used Speaker Model > Θ Accept < Θ Reject Feature Extraction / Λ signal Background Model EE225D Final Project

  7. Design • State-of-the Art Systems use Gaussian Mixture Models • Speaker’s acoustic space is represented by many-component mixture of Gaussians • Gives very good performance, but… speaker 1 speaker 2 speaker 3 EE225D Final Project

  8. Design • Concern: GMMs utilize a “bag-of-frames” approach • Frames assumed to be independent • Sequential information is no really utilized • Alternative: Use HMMs • Do likelihood test on output from recognizer, which is an accumulated log-probability score • Text-independent system has been analyzed (Weber et. al from Dragon Systems) • Let’s try a text-dependent one! EE225D Final Project

  9. System Word-level HMM-UBM detectors HMM-UBM 1 Combination Word Extractor HMM-UBM 2 signal Λ HMM-UBM N Topology: Left-right HMM with self-loops and no skips 4 components per state Number of states related to number of phones and median number of frames for word EE225D Final Project

  10. System HMMs implemented using HMM toolkit (HTK) -Used for speech recognition Input features were 12cepstra, first differences, and zeroth order cepstrum (energy parameter) Adaptation: Means were adapted using Maximum A Posteriori adaptation In cases of no adaptation data, UBM was used -LLR score cancels EE225D Final Project

  11. Word Selection 13 Words: Discourse markers: {actually, anyway, like, see, well, now, um, uh} Backchannels: {yeah, yep, okay, uhhuh, right } EE225D Final Project

  12. Recognition Task NIST Extended Data Evaluation: Training for 1,2,4,8, and 16 complete conversation sides and testing on one side (side duration ~2.5 mins) Uses Switchboard I corpus -Conversational telephone speech Cross-validation method where data is partitioned Test on one partition; use others for background models and normalization For project, used splits 4-6 for background and 1 for testing with 8-conversation training EE225D Final Project

  13. Scoring Target score: output of adapted HMM to forced alignment recognition of word from true transcripts and SRI recognizer UBM score: output of non-adapted HMM to same forced alignment Frame normalization: Word normalization: Average of word-level frame normalizations N-best normalization: Frame normalization on n best matching (i.e. high log-prob) words EE225D Final Project

  14. Results Observations: 1)Frame norm = word norm 2)EER of n-best decreases with increasing n -Suggests benefit from an increase in data EE225D Final Project

  15. Results Comparable results: Sturim et al. text-dependent GMM Yielded EER of 1.3% -Larger word pool -Channel normalization EE225D Final Project

  16. Results Observations: 1)EERs for most lie in a small range of 7% -Indicates words, as a group, share some qualities -last two may differ greatly partly because of data scarcity 2)Best word (“yeah”) yielded EER of 4.63% compared with 2.87% for all words EE225D Final Project

  17. Conclusions Well performing text-dependent speaker recognition in an unconstrained speech domain is feasible Benefit of sequential information applied in this fashion is unclear -Can compete with GMM, but can it be superior? EE225D Final Project

  18. Future Work -Channel Normalization -Examine influence of word context (e.g., “well” as discourse marker and as adverb) -Revise word list EE225D Final Project

  19. Acknowledgements -Barbara Peskin -Chuck Wooters -Yang Liu EE225D Final Project

More Related