410 likes | 418 Views
Learn the basics of speaker recognition and brainstorm a coherent project plan for VC funding. Explore applications, challenges, features, and statistical tools in speech signal analysis. Dive into speaker verification tasks, models, and system design considerations. Gain insights into the dynamic field of biometric voice authentication.
E N D
An Intro to Speaker Recognition • Nikki Mirghafori • Acknowledgment: some slides borrowed from the Heck & Reynolds tutorial, and A. Stolcke.
Today’s class • Interactive • Measures of success for today: • You talk at least as much as I do • You learn and remember the basics • You feel you can do this stuff • We all have fun with the material!
A 10-minute “Project Design” • You are experts with different backgrounds. Your previous startup companies were wildly successful. A large VC firm in the valley wants to fund YOUR next creation, as long as the project is in speaker recognition. • The VC funding is yours, if you come up with some kind of a coherent plan/list of issues: • What is your proposed application? • What will be the sources of error and variability, i.e., technology challenges? • What types of features will you use? • What sorts of statistical modeling tools/techniques? • What will be your data needs? • Any other issues you can think of along your path?
Speech Recognition Words “How are you?” Language Recognition Language Name Speech Signal English Speaker Recognition Speaker Name James Wilson Extracting Information from Speech • What’s noise? what’s signal? • Orthogonal in many ways • Use many of the same models and tools Goal:Automatically extract information transmitted in speech signal
Speaker Recognition Applications • Access control • Physical facilities • Data and data networks • Transaction authentication • Telephone credit card purchases • Bank wire transfers • Fraud detection • Monitoring • Remote time and attendance logging • Home parole verification • Information retrieval • Customer information for call centers • Audio indexing (speech skimming device) • Personalization • Forensics • Voice sample matching
Tasks • Identification vs. verification • Closed set vs. open set identification • Also, segmentation, clustering, tracking...
Identification Speaker Model Database Test Speech Whose voice is it? Closed-set Speaker Identification
None of the above Identification Speaker Model Database Test Speech Whose voice is it? Open-set Speaker Identification
“It’s me!” Yes/No Verification/Authentication/Detection Speaker Model Database Test Speech Does the voice match? Verification requires claimant ID
Speech Modalities • Text-dependent recognition • Recognition system knows text spoken by person • Examples: fixed phrase, prompted phrase • Used for applications with strong control over user input • Knowledge of spoken text can improve system performance • Text-independent recognition • Recognition system does not know text spoken by person • Examples: User selected phrase, conversational speech • Used for applications with less control over user input • More flexible system but also more difficult problem • Speech recognition can provide knowledge of spoken text • Text-Constrained recognition. Exercise for the reader.
Text-constrained Recognition • Basic idea: build speaker models for words rich in speaker information • Example: • “What time did you say? um... okay, I_think that’s a good plan.” • Text-dependent strategy in a text-independent context
Are Know Have Voice as a biometric • Biometric: a human generated signal or attribute for authenticating a person’s identity • Voice is a popular biometric: • natural signal to produce • does not require a specialized input device • ubiquitous: telephones and microphone equipped PC • Voice biometric with other forms of security Strongest security • Something you have - e.g., badge • Something you know - e.g., password • Something you are - e.g., voice
How to build a system? • Feature choices: • low level (MFCC, PLP, LPC, F0, ...) and high level (words, phones, prosody, ...) • Types of models: • HMM, GMM, Support Vector Machines (SVM), DTW, Nearest Neighbor, Neural Nets • Making decisions: Log Likelihood Thresholds, threshold setting for desired operating point • Other issues: normalization (znorm, tnorm), optimal data selection to match expected conditions, channel variability, noise, etc.
Speech quality • Channel and microphone characteristics • Noise level and type • Variability between enrollment and verification speech Speech modality • Fixed/prompted/user-selected phrases • Free text Speech duration • Duration and number of sessions of enrollment and verification speech Speaker population • Size and composition Verification Performance • There are many factors to consider in design of an evaluation of a speaker verification system • Most importantly: The evaluation data and design should match the target application domain of interest
Text-independent (Read sentences) Military radio Data Multiple radios & microphones Moderate amount of training data Text-independent (Conversational) Telephone Data Multiple microphones Moderate amount of training data Text-dependent (Digit strings) Telephone Data Multiple microphones Small amount of training data Verification Performance Increasing constraints Probability of False Reject (in %) Text-dependent (Combinations) Clean Data Single microphone Large amount of train/test speech Probability of False Accept (in %)
Wire Transfer: False acceptance is very costly Users may tolerate rejections for security High Security Equal Error Rate (EER) = 1 % Balance Customization: False rejections alienate customers Any customization is beneficial High Convenience Verification Performance Example Performance Curve Application operating point depends on relative costs of the two error types PROBABILITY OF FALSE REJECT (in %) PROBABILITY OF FALSE ACCEPT (in %)
Human vs. Machine Humans44%better • Motivation for comparing human to machine • Evaluating speech coders and potential forensic applications • Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000) • Same amount of training data • Matched Handset-type tests • Mismatched Handset-type tests • Used 3-sec conversational utterances from telephone speech Humans15%worse ErrorRates
Features • Desirable attributes of features for an automatic system (Wolf ‘72) • Occur naturally and frequently in speech • Easily measurable • Not change over time or be affected by speaker’s health • Not be affected by reasonable background noise nor depend on specific transmission characteristics • Not be subject to mimicry Practical Robust Secure • No feature has all these attributes
Recognition Phase (e.g. Verification) Rejected Accepted Feature Extraction Verification Decision ? “It’s me!” Training & Test Phases Enrollment Phase Feature Extraction Model Training Model for each speaker Training speech for each speaker
Likelihood S came from speaker model L =log Likelihood S did not come from speaker model SpeakerModel > q accept L L + Feature extraction Decision S - L < q reject ImpostorModel Decision making Verification decision approaches have roots in signal detection theory • 2-class Hypothesis test: • H0: the speaker is an impostorH1: the speaker is indeed the claimed speaker. • Statistic computed on test utterance S as likelihood ratio:
Decision making • Identification: pick model (of N) with best score • Verification: usual approach is via likelihood ratio tests, hypothesis testing, i.e.: • By Bayes: • P(target|x)/P(nontarget|x) = P(x|target)P(target)/P(x|nontarget)P(nontarget) • accept if > threshold, reject otherwise • Can’t sum over all non-target talkers -- world for SV! • Use “cohorts” (collection of impostors) • Train “universal”/”world”/”background” model (speaker independent, it’s trained on many speakers)
Feature Extraction Adapt log likelihood ratio Background Model Speaker Model Sliding window Fourier Transform Magnitude Log Cosine Transform Spectral Based Approach • Traditional speaker recognition systems use • Cepstral feaures • Gaussian Mixture Models (GMMs) D.A. Reynolds, T.F. Quatieri, R.B. Dunn. “Speaker Verification using Adapted Gaussian Mixture Models,” Digital Signal Processing, 10(1--3), January/April/July 2000
High-level cues (learned behaviors) Semantic Dialogic Idiolectal Phonetic Prosodic Spectral Low-level cues (physical characteristics) Features: Levels of Information Hierarchy of Perceptual Cues
Low level features • Speech production model: source-filter interaction • Anatomical structure (vocal tract/glottis) conveyed in speech spectrum Glottal pulses Vocal tract Speech signal
Word N-gram Features Idea (Doddington 2001): • Word usage can be idiosyncratic to a speaker • Model speakers by relative frequencies of word N-grams • Reflects vocabulary AND grammar • Cf. similar approaches for authorship and plagiarism detection on text documents. • First (unpublished) use in speaker recognition: Heck et al. (1998) Implementation: • Get 1-best word recognition output • Extract N-gram frequencies • Model likelihood ratio OR • Model frequency vectors by SVM
Phone N-gram features Model the pattern of phone usage or “short term pronunciation” for a speaker Open-loop phone recognition Support Vector Machine (SVM) [+ 0.0254 0.0068 0.0198] [- 0.0001 0.8827 0.7264] [- 0.0329 0.2847 0.2983] score
MLLR transform vectors as features Speaker-dependent Speaker-independent Phone class B Phone class A Speaker-independent Speaker-dependent MLLR Transforms = Features
Models • HMMs: • text dep (could use whole word/phone model) • prompted (phone models) • text ind’t (use LVCSR) -- or GMMs! • templates DTW (if text-dependent system) • nearest neighbor: frame level, training data as “model”, non-parametric • neural nets: train explicitly discriminating models • SVMs
Speaker Models -- HMM • Speaker models (voiceprints) represent voice biometric in compact and generalizable form • Modern speaker verification systems use Hidden Markov Models (HMMs) • HMMs are statistical models of how a speaker produces sounds • HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states. • Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties. h-a-d
Fixed Phrase Word/phrase models “Open sesame” Prompted phrases/passwords Phoneme models /s/ /i/ /x/ Text-independent single state HMM General speech Speaker Models – HMM/GMM Form of HMM depends on the application
Word N-gram Modeling: Likelihood Ratios • Model N-gram token log likelihood ratio • Numerator: speaker language model estimated from enrollment data • Denominator: background language model estimated from large speaker population • Normalize by token count • Choose all reasonably frequent bigrams or trigrams, or a weighted combination of both
Speaker Recognition with SVMs • Each speech sample (training or test) generates a point in a derived feature space • The SVM is trained to separate the target sample from the impostor (= UBM) samples • Scores are computed as the Euclidean distance from the decision hyperplane to the test sample point • SVMs training is biased against misclassifying positive examples (typically very few, often just 1) Background sample Target sample Test sample
Feature Transforms for SVMs • SVMs have been a boon for higher-level (as well as cepstral speaker recognition) research – they allow great flexibility in the choice of features • However, we need a “sequence kernel” • Dominant approach: transform variable-length feature stream into fixed, finite-dimensional feature space • Then use linear kernel • All the action is in the feature transform!
Combination of Systems • Systems work best in combination, especially ones using “higher level” features • Need to estimate optimal combination weight. E.g., use neural network • Combination weights trained on a held-out development dataset GMM MMLR WordHMM PhoneNgram Neural Network Combiner
Compensation techniques help reduce error. Variability: The Achilles Heel... • Variability (extrinsic & intrinsic) in the spectrum can cause error • Data of focus has mainly been extrinsic • “Channel” mismatch: • Microphone • carbon-button, hands-free,.. • Acoustic environment • Office, car, airport, ... • Transmission channel • Landline, cellular, VoIP, ... • Three compensation approaches: • Feature-based • Model-based • Score-based
Linguistic Data Consortium Data Provider Evaluation Coordinator Comparison of technologies on common task Technology Developers Evaluate NIST Speaker Verification Evaluations • Annual NIST evaluations of speaker verification technology (since 1996) • Aim: Provide a common paradigm for comparing technologies • Focus: Conversational telephone speech (text-independent) Improve
The NIST Evaluation Task • Conversational telephone speech, interview • Landline, cellular, hands-free, multiple-mics in room • 5 min of conversations between two speakers • Various conditions, e.g., • Training: 8, 1, or other number of conversation sides • Test: 1 conversation side, 30 secs, etc. • Evaluation: • Equal Error Rate (EER) • Decision Cost Function (DCF) • = (10, 1, 0.01)
The End • What’s one interesting you learned today you may share with a friend over dinner conversation?
Word Conditional Models -- example • Boakye et al. (2004) • 19 words and bi-grams • Discourse markers: {actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean} • Filled pauses: {um, uh} • Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know } • Trained whole-word HMMs, instead of GMMs, to model evolution of speech in time • Combines well with low-level (i.e., cepstral GMM) system, especially with more training data
Phone N-Grams -- example • Idea (Hatch et al., ‘05): model the pattern of phone usage or “short term pronunciation” for a speaker • Use open-loop phone recognition to obtain phone hypotheses • Create models of relative frequencies of phone n-grams of the speaker vs. “others” • Use SVM for modeling • Combines well, esp. with increased data • Works across languages