400 likes | 697 Views
Speaker Verification via Kernel Methods. Speaker : Yi-Hsiang Chao Advisor : Hsin-Min Wang. OUTLINE. Current Methods for Speaker Verification Proposed Methods for Speaker Verification Kernel Methods for Speaker Verification Experiments Conclusions. What is speaker verification ?.
E N D
Speaker Verification via Kernel Methods Speaker : Yi-Hsiang Chao Advisor : Hsin-Min Wang
OUTLINE • Current Methods for Speaker Verification • Proposed Methods for Speaker Verification • Kernel Methods for Speaker Verification • Experiments • Conclusions
What is speaker verification ? • Goal: To determine if a speaker is who he or she claims to be. • Speaker verification is a hypothesis testing problem. • Given an input utterance U, two hypotheses have to be considered as • H0: U is from the target speaker.(the null hypothesis) • H1: U is not from the target speaker.(the alternative hypothesis) • The Likelihood Ratio (LR)test: • Mathematically, H0 and H1 can be represented by parametric models denoted as and , respectively. • is often called an anti-model. (1)
Current Methods for Speaker Verification • is usually ill-defined, since H1 does not involve any specific speaker, and thus lacks explicit data for modeling. • Many approaches have been proposed in attempts to characterize H1: • One simple approach is to train a single speaker-independent model , named the world model or the Universal Background Model (UBM) [D. A. Reynolds, et al., 2000]: • The training data are collected from a great amount of speakers, generally irrelevant to the clients.
Current Methods for Speaker Verification Instead of using a single model, an alternative way is to train a set of cohort models {1, 2,…, B}. This gives the following possibilities in computing LR: • Picking the likelihood of the most competitive model: [A. Higgins, et al., 1991] • Averaging the likelihoods of the B cohort models arithmetically: [D. A. Reynolds, 1995]: • Averaging the likelihoods of the B cohort models geometrically : [C. S. Liu , et al., 1996]:
Current Methods for Speaker Verification • Selection of cohort set • Two cohort selection methods [D. A. Reynolds, 1995] are used: • One selects the B closest speakers to each client. (such as L2, L3, L4) • The other selects the B/2 closest speakers to, plus the B/2 farthest speakers from, each client.(such as L3) • The selection is based on the speaker distance measure [D. A. Reynolds, 1995], computed by where and are speaker models trained using the i-th speaker’s training utterances and the j-th speaker’s training utterances , respectively.
Current Methods for Speaker Verification • The Null Hypothesis Characterization • The client model is represented by a Gaussian Mixture Model (GMM): • can be trained via the ML criterion by using the Expectation-Maximization (EM) algorithm. • can be derived from the UBM using MAP adaptation. (the adapted GMM). • The adapted GMM + L1 measure => we term the GMM-UBM system. [D. A. Reynolds, et al., 2000] • Currently,GMM-UBM is the state-of-the-art approach. • This method is appropriate for the Text-Independent (TI) task. • Advantage: cover unseen data.
Proposed Methods for Speaker Verification • Motivation: • However, none of the LR measures developed so far has proved to be absolutely superior to the others in any tasks and applications. • We propose two perspectives in attempts to better characterize the ill-defined alternative hypothesis . • Perspective 1: • Optimal combination of the existing LRs. • Perspective 2: • On the design of the novel alternative hypothesis characterization.
Perspective 1: The Proposed Combined LR (ICPR2006) • The pros and cons of different LR measures motivate us to try to combine them into a unified framework by virtue of the complementary information that each LR can contribute. • Given N different LR measures Li(U), i = 1, 2,…, N. We define a combined LR measure by (2) where x = [L1(U), L2(U),…, LN(U)]T is an N× 1 vector in the space RN, w = [w1, w2,…, wN]T is an N× 1 weight vector, and b is a bias.
Linear Discriminant Classifier • forms a so-called linear discriminant classifier. • This classifier translates the goal of solving an LR measure into the optimization of w and b, such that the utterances of clients and impostors can be separated. • To realize this classifier, three distinct data sets are needed: • One for generating each client’s model. • One for generating each client’s anti-models. • One for optimizing w and b.
Linear Discriminant Classifier • The bias b actually plays the same role as the decision threshold of the LR defined in Eq. (1). • it can be determined through a trade-off between false acceptance and false rejection, • The main goal here is to find w. • f(x) can be solved via linear discriminant training algorithms, such as: • Fisher’s Linear Discriminant (FLD). • Linear Support Vector Machine (Linear SVM) . • Perceptron.
Linear Discriminant Classifier • Using Fisher’s Linear Discriminant (FLD) • Suppose the i-th class has ni data samples, , i = 1, 2. • The goal of FLD is to seek a direction w such that the following Fisher’s criterion function J(w) is maximized: where Sb and Sw are, respectively, the between-class scatter matrix and the within-class scatter matrix defined as where is the mean vectorof the i-th class.
Linear Discriminant Classifier • Using Fisher’s Linear Discriminant (FLD) • The solution for w, which maximizes the Fisher’s criterion J(w), is the leading eigenvector of . • w can be directly calculated as (3)
Analysis of the Alternative Hypothesis • The LR approaches that have been proposed to characterize H1 can be collectively expressed in the following general form : where F() is some function of the likelihood values from a set of so-called background models {1,2,…,N}. • For example, F() can be • the average function for L3(U), the maximum for L2(U) or the geometric mean for L4(U), and the background model set here can be obtained froma cohort. • A special case arises when F() is an identity function and N = 1. In this instance, a single background model is used for L1(U). (4)
where is an N×1 vector and is the weight of the likelihood p(U | i), i = 1,2,…, N. Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006) • We redesign the function F() as • This function gives N background models different weights according to their individual contribution to the alternative hypothesis. • It is clear that Eq. (5) is equivalent to a geometric mean function when • It is also clear that Eq. (5) will reduce to a maximum function when (5)
where is an N×1 weight vector and x is an N× 1 vector in the space RN, expressed by Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006) By substituting Eq. (5) into Eq. (4) and letting (6) (7)
Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006) • The implicit idea in Eq. (7) is that the speech utterance U can be represented by a characteristic vector x. • If we replace the threshold in Eq. (6) with a bias b, the equation can be rewritten as • Analogous to the combined LR method in Eq. (2). • f(x) in Eq. (8) forms a linear discriminant classifier again, which can be solved via linear discriminant training algorithms, such as FLD. (8)
Perspective 2: The Novel Alternative Hypothesis Characterization (submitted to ISCSLP2006) • Relation to Perspective 1: The combined LR measure If the anti-models are instead of the background models for the characteristic vector x defined in Eq. (7): We obtain f(x)forms a linear combination of N different LR measures, which is the same as the combined LR measure.
Kernel Methods for Speaker Verification • can be solved via linear discriminant training algorithms. • However, such methods are based on the assumption that the observed data of different classes is linearly separable. • It is obviously not feasible in most practical cases with nonlinearly separable data. • From this point of view, we hope • The data from different classes, which is not linearly separable in the original input space RN. • They can be separated linearly in a certain implicit higher dimensional (maybe infinite) feature space F via a nonlinear mapping Φ. • Let Φ(x) denote a vector obtained by mapping x from RN to F.f(x)can be re-defined as (9) which constitutes a linear discriminant classifier in F.
Kernel Methods for Speaker Verification • In practice, it is difficult to determine the kind of mapping Φ that would be applicable. • Therefore, the computation of Φ(x) can be infeasible. • We propose using the kernel method: • It is to characterize the relationship between the data samples in F, instead of computing Φ(x) directly. • This is achieved by introducing a kernel function: which is the inner product of two vectors Φ(x) and Φ(y) in F. (10)
Kernel Methods for Speaker Verification • The kernel function k() must be symmetric, positive definite and conform to Mercer’s condition. • For example: • The dot product kernel : • The d-th degree polynomial kernel : • The Radial Basis Function (RBF) kernel : • Existing kernel-based classification techniques can be applied to implement . such as : • Support Vector Machine (SVM). • Kernel Fisher Discriminant (KFD). σ is a tunable parameter.
Kernel Methods for Speaker Verification • Support Vector Machine (SVM) • Techniques based on SVM have been successfully applied to many classification and regression tasks. • Conventional LR: • If the probabilities are perfectly estimated (which is usually not the case), then the Bayes Decision rule is the optimal decision. • However, a better solution should in theory be to use a discriminant framework [V. N. Vapnik, 1995]. • [S. Bengio, et al., 2001] proposed that the probability estimates are not perfect and that a betterversion would be, where a1 ,a2 and b are adjustable parameters estimated using an SVM.
Kernel Methods for Speaker Verification • Support Vector Machine (SVM) • [S. Bengio, et al., 2001] incorporated the two scores obtained from GMM and UBM with an SVM. • Compare with our approach: • [S. Bengio, et al., 2001] only used one simple background model, the UBM, as the alternative hypothesis characterization. • Our approach is considered to integrate multiple background models for the alternative hypothesis characterization in a more effective and robust way:
Kernel Methods for Speaker Verification y y Support vectors r Optimal margin x x Optimal hyperplane (a) (b) Classifier in (b) has greater separation distance than (a) • Support Vector Machine (SVM) • The goal of SVM is to seek a separating hyperplane in the feature space F that maximizes the margin between classes.
Kernel Methods for Speaker Verification • Support Vector Machine (SVM) • Following the theory of SVM, w can be expressed as which yields where each training sample xj belongs to one of the two classes identified by the label yj{1,1}, j=1, 2,…, l.
Kernel Methods for Speaker Verification • Support Vector Machine (SVM) • LetT = [1, 2,…, l]. Our goal now changes from finding w to finding . • We can find the coefficients j by maximizing the objective function, subject to the constraints where C is a penalty parameter. • The above optimization problem can be solved using the quadratic programming techniques.
y Support vectors Optimal margin x Optimal hyperplane Kernel Methods for Speaker Verification • Support Vector Machine (SVM) • Note that mostj are equal to zero, and the training samples with non-zeroj are called support vectors. • A few support vectors act as the key to deciding the optimal margin between classes in the SVM. • An SVM with a dot product kernel function, i.e., is known as a linear SVM.
Kernel Methods for Speaker Verification • Kernel Fisher Discriminant (KFD) • Alternatively, can be solved with KFD. • In fact, the purpose of KFD is to apply FLD in the feature space F. we also need to maximize the Fisher’s criterion: where and are, respectively, the between-class and the within-class scatter matrices in F, i.e., where is the mean vectorof the i-th class in F.
Kernel Methods for Speaker Verification • Kernel Fisher Discriminant (KFD) • Let and . • According to the theory of reproducing kernels, the solution of w must lie in the span of all training data samples mapped in F, w can be expressed as • Accordingly, can be re-written as • LetT = [1, 2,…, l]. Our goal therefore changes from finding w to finding , which maximizes
Kernel Methods for Speaker Verification • Kernel Fisher Discriminant (KFD) where Iniis an ni×ni identity matrix, and 1ni is an ni×ni matrix with all entries 1/ni. The solution for is analogous to FLD in Eq. (3): which is also the leading eigenvector of N-1M.
Experiments • Formationof the Characteristic Vector • In our methods, we use B+1 background models, consisting of • B cohort set models, • One world model, to form the characteristic vector x. • Two cohort selection methods are used in the experiments: • B closest speakers. • B/2 closest speakers + B/2 farthest speakers. • To yield the following two (B+1)×1 characteristic vectors: where and are, respectively, the i-th closest model and the i-th farthest model of the client model .
Experiments • Detection cost Function (DCF) • The NIST Detection Cost Function (DCF) , which reflects the performance at a single operating point on the DET curve. • The DCF is defined as • and are the miss probability and the false-alarm probability, respectively. • and are the respective relative costs of detection errors. • is the a priori probability of the specific target speaker. • A special case of the DCF is known as the Half Total Error Rate (HTER), where and are both equal to 1, and = 0.5, i.e.,
Experiments - XM2VTSDB • “Training” subset to build the individual client’s model and anti-models. • “Evaluation” subset to estimate ,w and b. • “Test” subset for the performance evaluation. 1. “0 1 2 3 4 5 6 7 8 9”. 2. “5 0 6 9 2 8 1 3 7 4”. 3. “Joe took father’s green shoe bench out”.
Experimental results (ICPR2006) • XM2VTSDB For perspective 1: • The proposed combined LR Further analysis of the results via the equal error rate (EER) showed that a 13.2% relative improvement was achieved by KFD (EER = 4.6%), compared to 5.3% of L3(U). Figure 1. Baselines vs. the Combined LRs : DET curves for “Test”.
Experimental results (submitted to ISCSLP2006) • XM2VTSDB • For perspective 2: The novel alternative hypothesis characterization A 30.68% relative improvement was achieved by KFD_w_20c, compared to L3_10c_10f – the best baseline system.
Experimental results (submitted to ISCSLP2006) • XM2VTSDB For perspective 2: • The proposed novel alternative hypothesis characterization Figure 2. Best baselines vs. our proposed LRs : DET curves for “Test” subset.
Evaluation on the ISCSLP2006-SRE database • For perspective 2: • The proposed novel alternative hypothesis characterization. • In the text-independent speaker verification task. • We observe that KFD_w_50c_50f achieved a 34.08% relative improvement over GMM-UBM. with
Evaluation on the ISCSLP2006-SRE database • We participated in the text-independent speaker verification task of the ISCSLP2006 Speaker Recognition Evaluation (SRE) plan. • The evaluation results are given as follows
Conclusions • We have introduced current LR systems for speaker verification. • We have presented two proposed LR systems: • The combined LR system. • The new LR system with the novel alternative hypothesis characterization. • Both proposed LR systems can be formulated as a linear or non-linear discriminant classifier. • Non-linear classifiers can be implemented by using kernel methods: • Kernel Fisher Discriminant (KFD) • Support Vector Machine (SVM) • Experiments conducted on two speaker verification tasks • The XM2VTSDB task • The ISCSLP2006-SRE task • The superiority of our methods over conventional approaches.