390 likes | 1.15k Views
Introduction to Speaker Diarization. Date: 2007/08/16 Speaker: Shih-Sian Cheng. Outline. Speaker diarization Problem formulation A prototypical speaker diarization system Speaker segmentation Problem formulation Speaker segmentation using a fixed-size analysis window
E N D
Introduction to Speaker Diarization Date: 2007/08/16Speaker: Shih-Sian Cheng
Outline • Speaker diarization • Problem formulation • A prototypical speaker diarization system • Speaker segmentation • Problem formulation • Speaker segmentation using a fixed-size analysis window • Speaker segmentation using a variable-size analysis window • Bottom-up segmentation using BIC • Top-down segmentation using BIC • Speaker clustering • Problem formulation • Hierarchical agglomerative clustering • Optimization-oriented approaches • Two leading speaker diarization systems • LIMSI’s system • Cambridge’s system
speaker segmentation speaker clustering Speaker 3 Speaker 1 Speaker 2 Speaker diarization (Problem formulation) • Problem formulation: the “who spoke when” task on an continuous audio stream (NIST RT03 Spring Eval.)
Speaker diarization (Problem formulation) • Performance measure of the speaker diarization task (C. Barras et. al., 2006 ; NIST RT03 Spring Eval.) Find the mapping between reference speakers and hypothesis speakers such that their overlapping in time is largest. In this case, S1->A and S3->B. • Applications
By speaker recognition Speaker adaptation+ speech recognition Speaker diarization (Problem formulation) • Example: Automatic transcription for a broadcast news show
To filter out non-speech data Speaker segmentation (usually, over segmentation) Speaker clustering Change boundary refinement Speaker diarization (A prototypical system) • The prototypical speaker diarization system (S. E. Tranter & D. A. Reynolds, 2006)
detect the speaker change boundaries miss detection Target changes Hypothesized changes false alarm Speaker segmentation (Problem formulation) • Problem formulation • Performance measure • Error type: miss detection & false alarm • Performance metric: ROC curve ROC curve: F-score: P: precision rate R: recall rate
Sliding windows Data stream Distance computation Distance curve Speaker segmentation (Fixed-size analysis window approach) • Speaker segmentation using a fixed-size analysis window (Siegler et. al., 1997) • Distance measure of two segments • Kullback-Leibler (KL) distance (Siegler et. al., 1997)
Y X Speaker segmentation (Fixed-size analysis window approach) • SVM training error (王駿發 et. al., 2005) More overlap, larger training error larger distance, less similarity Y X
Speaker segmentation (Fixed-size analysis window approach) • ΔBIC (S. Chen et. al., 1998; P. Delacourt et. al., 2001) Bayesian information criterion (BIC) for model selection: • Data set: • Candidate models: • Model selection by BIC: λ=1 in the BIC theory, but is usually tuned for trade-off between error types; maximum likelihood of X for model ; : the number of parameters of ;
Seg Y Seg X Speaker segmentation (Fixed-size analysis window approach) Use BIC as an inter-segment distance computation Given two audio segments represented by feature vectors and these two segments can be judged as under the same or different acoustic conditions via the following hypothesis test: X and Y are judged as from the same acoustic condition if BIC <0. Ex: X and Y are from different acoustic conditions, BIC>=0 X and Y are from the same acoustic condition, BIC<=0 Seg Y Seg X
Speaker segmentation (Variable-size analysis window approach) • Speaker segmentation using a variable-size analysis window • Bottom-up detection using BIC (S. Chen and P. Gopalakrishnan, 1998; M. Cettolo et. al., 2005 ) • The bottom-up detection process on an audio stream Audio stream Seg Seg 2 2 Seg Seg 3 3 Seg Seg 4 4 Seg Seg 1 1 Change point One-change- point detection
X Y BIC Calculate at each feature vector X Y BIC Calculate at each feature vector BIC BIC Speaker segmentation (Variable-size analysis window approach) • One-change-point detection using BIC Feature vectors
Speaker segmentation (Variable-size analysis window approach) • Top-down detection using BIC (C. H. Wu and C. H. Hsieh, 2006; M. Cettolo et. al., 2005 ) • The top-down detection process for an audio stream Audio stream Seg Seg 2 2 Seg Seg 3 3 Seg Seg 4 4 Seg Seg 1 1 multiple-change-detection
Intuitively, pr(X| H0)<pr(X| H1)<pr(X| H2)<pr(X| H3) but, BIC(X|H2)>BIC(X| H3)>BIC(X| H1)> BIC(X| H0) Speaker segmentation (Variable-size analysis window approach) • Multiple-change-detection using BIC Assumption: different segments arise from different Gaussian processes Audio stream Seg Seg 2 2 Seg Seg 3 3 Seg Seg 4 4 Seg Seg 1 1 X H0 : H1 : H2 : H3 : Multiple-change-detection: Search the H that has the largest BIC value in the solution space • Exhausted search
Pass1: Pass2: Speaker segmentation (Variable-size analysis window approach) • Top-down, hierarchical search (C. H. Wu and C. H. Hsieh, 2006) Audio stream Seg Seg 2 2 Seg Seg 3 3 Seg Seg 4 4 Seg Seg 1 1 X Terminate An sub-optimal search • Dynamic programming (M. Cettolo et. al., 2005 ) • An optimal search
Speaker clustering (Problem formulation) • Problem formulation • given N speech utterances from P unknown speakers, partition these utterances into M clusters, such that M = P and each cluster consists exclusively of utterances from only one speaker
Speaker clustering (Problem formulation) Increases as the number of clusters increases • Cluster Purity The probability that if we pick any utterance from a cluster twice at random, with replacement, both of the selected utterances are from the same speaker P : total no. of speakers involved, M : total no. of clusters, m : purity of them-th cluster, nm* : no. of utterances in them-th cluster, n*p : no. of utterances from thep-th speaker, nmp : no. of utterances in them-th cluster that are from thep-th speaker
Type II error: The number of utterance pairs from the same speaker that are in the same cluster The number of utterance pairs from the same speaker Type I error: The number of utterance pairs from the same cluster and are in the same cluster The number of utterance pairs from the same cluster Speaker clustering (Problem formulation) • Rand Index Two error types: I: The number of utterance pairs (with replacement) in the same cluster but from different speakers II: The number of utterance pairs (with replacement) from the same speaker but in different clusters Reaches its minimum only when M = P
X X X 1 2 N X X X X X 1 19 2 13 N X X X X X 1 2 13 19 N X X X X X 1 2 13 14 N Speaker clustering (Hierarchical agglomerative clustering) • Hierarchical agglomerative clustering (S. Chen and P. Gopalakrishnan, 1998; Barras et. al., 2006) • Distance of two clusters: ΔBIC • Stopping criteria: • Local BIC • Global BIC
(oi , oj ) (the ground truth) is unknown and needs to be estimated. S(Xi,Xj): similarity between utterancesXiandXj R[S(Xi,Xj)]: rank of inter-utterance similarityS(Xi,Xj) among S(Xi,X1), S(Xi,X2), …, S(Xi,XN) in descending order i : utterance most similar toXi, i.e., R[S(Xi,Xi)] = 2. (oi, oj) is approximated by … mth-cluster ; nm=4 Speaker clustering ( Optimization-oriented approaches ) • Optimization-oriented approaches • Maximum purity clustering (W. H. Tsai et. al., IEEE Trans. ASLP, 2007) • For a given number of cluster and a set of cluster indices H = [ h1, h2, …, hN ] for Nutterances X1 , X2 ,…, XN , the average cluster purity is oiis the true speaker index of utterance Xi, (1 oi P )
Speaker clustering ( Optimization-oriented approaches ) Let denote the estimated purity. Use Genetic Algorithm to find H* such that • Use BIC to determine the cluster number • Minimum rand index clustering (W. H. Tsai and H. M. Wang, Proc. ICASSP, 2007): Performing the grouping of utterances and determining the group number at within the optimization process (oi , oj ) (the ground truth) is unknown and needs to be estimated.
N N N N Use Genetic Algorithm to find H* such that å å å å ˆ ˆ = d + W - d d ( ) ( ) ( ) ( ) ( ) M M M M M R ( H ) ( h , h ) 2 ( h , h ) ( o , o ) i j i j i j (oi,oj) is approximated by a normalized inter-utterance similarity: = = = = i 1 j 1 i 1 j 1 (Generalized likelihood Ratio) where Smaxis the maximum among the similaritiesS(Xi, Xj), ij. Speaker clustering ( Optimization-oriented approaches )
To remove only long regions without speech such as silence, music, and noise using GMM Fixed-size sliding window segmentation Boundary refinement Use ΔBIC to measure the inter-cluster similarity Boundary refinement; Align the change boundaries to silence portions Use the cross-likelihood ratio, to measure the inter-cluster similarity. Miis a MAP-adapted GMM . To filter out short-duration silence segments that were not removed in the initial speech detection step Two leading systems • LIMSI’s system (Barras et. al., 2006) ,
Two leading systems • Cambridge’s system (Sinha et. al., 2005) SD: speech detection CPD: change point detection IAC: iterative agglomerative clustering Speaker identification (SID) clustering: MAP adaptation (mean-only) was applied towards each cluster from the appropriate gender/bandwidth UBM. Use the cross likelihood ratio (CLR) between any two given clusters.
Reference • C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage Speaker Diarization of Broadcast News,” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006. • NIST 2003 Spring, http://www.nist.gov/speech/tests/rt/rt2003/spring/ • R. Sinha, S. E. Tranter, M.J.F. Gales, P. C. Woodland, “The Cambridge University March 2005 Speaker Diarization System,” INTERSPEECH 2005. • S. E. Tranter & D. A. Reynolds, “An Overview of Automatic Speaker Diarisation Systems,” IEEE Transactions on Audio, Speech and Language Processing, Special Issue on Rich Transcription, 2006. • S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion,” in Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998. • C. H. Wu and C. H. Hsieh, “Multiple Change-Point Audio Segmentation and Classification Using an MDL-based Gaussian Model,”IEEE Transactions on Audio, Speech and Language Processing, 2006. • M. Cettolo, M. Vescovi, and R. Rizzi, “Evaluation of BIC-based algorithms for audio segmentation,” Computer Speech and Language, 2005. • M. Siegler, U. Jain, B. Raj and R. Stern, “Automatic Segmentation, Classification and clustering of broadcast News Audio,” in Proc. DARPA Speech Recognition Workshop, 1997. • P. Delacourt and C. J. Welkens, “DISTBIC: A Speaker-based segmentation for Audio Data Indexing", Speech Communication, vol. 32, pp 111-126, 2000. • 王駿發, 林博川, 王家慶, 宋豪靜, “以支援向量機為基礎之新穎語者切換偵測演算法,” in Proc. ROCLING 2005.
Reference • C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain, “Multistage Speaker Diarization of Broascast News," IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no.5, pp. 1505-1512, 2006. • Wei-Ho Tsai, Shih-sian Cheng, and Hsin-min Wang, "Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation," IEEE Trans. on Audio, Speech, and Language Processing, volume 15, number 4, pages 1461-1474, May 2007. • Wei-Ho Tsai and Hsin-min Wang, "Speaker Clustering Based on Minimum Rand Index," IEEE Int. Conf. Acoustics, Speech, Signal processing (ICASSP2007), April 2007. • R. Sinha, S. E. Tranter, M.J.F. Gales, P. C. Woodland, “The Cambridge University March 2005 Speaker Diarization System,” INTERSPEECH 2005.