230 likes | 394 Views
Variational Bayesian Methods for Audio Indexing. Fabio Valente, Christian Wellekens Institut Eurecom. Outline. Generalities on speaker clustering Model selection/BIC Variational learning Variational model selection Results. Speaker clustering.
E N D
Variational Bayesian Methodsfor Audio Indexing Fabio Valente, Christian Wellekens Institut Eurecom
Outline • Generalities on speaker clustering • Model selection/BIC • Variational learning • Variational model selection • Results
Speaker clustering • Many applications (speaker indexing, speech recognition) require clustering segments with the same characteristics e.g. speech from the same speaker. • Goal: grouping together speech segments of the same speaker • Fully connected (ergodic) HMM topology with duration constraint. Each state represent a speaker. • When speaker number is not known it must be estimated with a model selection criterion (e.g. BIC,…)
Model selection Given data Y and model m optimal model maximizes: If prior is uniform, decision depends only on p(Y|m) (a.k.a. marginal likelihood) Bayesian modeling assumes distributions over parameters The criterion is thus the marginal likelihood: Prohibitive to compute for some models (HMM,GMM)
Bayesian information criterion (BIC) First order approximation obtained from the Laplace approximation of the marginal likelihood (Schwartz, 1978) Generally, penalty is multiplied by a constant (threshold): BIC does not depend on parameter distributions ! Asymptotically (n large) BIC converges to log-marginal likelihood
Variational Learning Introduce an approximated variational distribution Applying Jensen inequality ln p(Y|m) maximization is then replaced by maximization of
Variational Learning with hidden variables Sometimes model optimization needs the use of hidden variables(e.g. state sequence in the EM) If x is the hidden variable, we can write: Independence hypothesis
EM-like algorithm Under the hypothesis: E-step: M-step:
VB Model selection In the same way an approximated posterior distribution over models can be defined: Maximizing w.r.t. q(m) yields: Model selection based on Best model maximizes q(m)
Experimental framework • BN-96 Hub4 evaluation data set • Initialize a model with N speakers (states) and train the system using VB and ML (or VB and MAP with UBM) • Reduce the speaker number from N-1 to 1 and train using VB and ML (or MAP). • Score the N models with VB and BIC and choose the best one • Three score • Best score • Selected score (with VB or BIC) • Score obtained with the known speaker number • Results given in terms of : Acp: average cluster purity Asp: average speaker purity
Dependence on threshold K function of the threshold Speaker number function of the threshold
Conclusions and Future Works • VB uses free energy for parameter learning and model selection. • VB generalizes both ML and MAP learning framework. • VB outperforms ML/BIC on 3 of the 4 BN files. • VB outperforms MAP/BIC on 4 of the 4 BN files. • Repeat the experiments on other databases (e.g. NIST speaker diarization).
Data vs. Gaussian components Final gaussian components function of amount of data for each speaker