190 likes | 291 Views
Text-Constrained Speaker Recognition Using Hidden Markov Models. Kofi A. Boakye EE225D Final Project. EE225D Final Project. Introduction. Speaker Recognition Problem: Determine if spoken segment is putative target Method of Solution Requires Two Parts: Training Testing
E N D
Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project EE225D Final Project
Introduction • Speaker Recognition Problem: Determine if spoken segment is putative target • Method of Solution Requires Two Parts: • Training • Testing • Similar to speech recognition though noise (inter-speaker variability) is now signal. EE225D Final Project
Introduction • Also like speech recognition, different domains exist • Two major divisions: • Text-dependent/Text-constrained • Text-independent • Text-dependent systems can have high performance because of input constraints • More of the acoustic variation is speaker-distinctive EE225D Final Project
Introduction Question: Is it possible to capitalize on advantages of text-dependent systems in text-independent domains? Answer: Yes! EE225D Final Project
Introduction Idea: Limit words of interest to a select group-Words should have high frequency in domain-Words should have high speaker-discriminative quality What kind of words match these criteria for conversational speech ?1) Discourse markers (um, uh, like, …)2) Backchannels (yeah, right, uhhuh, …) These words are fairly spontaneous and represent an “involuntary speaking style” (Heck, WS2002) EE225D Final Project
Design Likelihood Ratio Detector: Λ = p(X|S) /p(X|UBM) Task is a detection problem, so use likelihood ratio detector -In implementation, log-likelihood is used Speaker Model > Θ Accept < Θ Reject Feature Extraction / Λ signal Background Model EE225D Final Project
Design • State-of-the Art Systems use Gaussian Mixture Models • Speaker’s acoustic space is represented by many-component mixture of Gaussians • Gives very good performance, but… speaker 1 speaker 2 speaker 3 EE225D Final Project
Design • Concern: GMMs utilize a “bag-of-frames” approach • Frames assumed to be independent • Sequential information is no really utilized • Alternative: Use HMMs • Do likelihood test on output from recognizer, which is an accumulated log-probability score • Text-independent system has been analyzed (Weber et. al from Dragon Systems) • Let’s try a text-dependent one! EE225D Final Project
System Word-level HMM-UBM detectors HMM-UBM 1 Combination Word Extractor HMM-UBM 2 signal Λ HMM-UBM N Topology: Left-right HMM with self-loops and no skips 4 components per state Number of states related to number of phones and median number of frames for word EE225D Final Project
System HMMs implemented using HMM toolkit (HTK) -Used for speech recognition Input features were 12cepstra, first differences, and zeroth order cepstrum (energy parameter) Adaptation: Means were adapted using Maximum A Posteriori adaptation In cases of no adaptation data, UBM was used -LLR score cancels EE225D Final Project
Word Selection 13 Words: Discourse markers: {actually, anyway, like, see, well, now, um, uh} Backchannels: {yeah, yep, okay, uhhuh, right } EE225D Final Project
Recognition Task NIST Extended Data Evaluation: Training for 1,2,4,8, and 16 complete conversation sides and testing on one side (side duration ~2.5 mins) Uses Switchboard I corpus -Conversational telephone speech Cross-validation method where data is partitioned Test on one partition; use others for background models and normalization For project, used splits 4-6 for background and 1 for testing with 8-conversation training EE225D Final Project
Scoring Target score: output of adapted HMM to forced alignment recognition of word from true transcripts and SRI recognizer UBM score: output of non-adapted HMM to same forced alignment Frame normalization: Word normalization: Average of word-level frame normalizations N-best normalization: Frame normalization on n best matching (i.e. high log-prob) words EE225D Final Project
Results Observations: 1)Frame norm = word norm 2)EER of n-best decreases with increasing n -Suggests benefit from an increase in data EE225D Final Project
Results Comparable results: Sturim et al. text-dependent GMM Yielded EER of 1.3% -Larger word pool -Channel normalization EE225D Final Project
Results Observations: 1)EERs for most lie in a small range of 7% -Indicates words, as a group, share some qualities -last two may differ greatly partly because of data scarcity 2)Best word (“yeah”) yielded EER of 4.63% compared with 2.87% for all words EE225D Final Project
Conclusions Well performing text-dependent speaker recognition in an unconstrained speech domain is feasible Benefit of sequential information applied in this fashion is unclear -Can compete with GMM, but can it be superior? EE225D Final Project
Future Work -Channel Normalization -Examine influence of word context (e.g., “well” as discourse marker and as adverb) -Revise word list EE225D Final Project
Acknowledgements -Barbara Peskin -Chuck Wooters -Yang Liu EE225D Final Project