Intra-Class Variability Modeling for Speech Processing

Intra-Class Variability Modeling for Speech Processing Dr. Hagai Aronowitz IBM Haifa Research Lab Presentation is available online at: http://aronowitzh.googlepages.com/

Speech Classification Proposed framework • Given labeled training segments from class + and class –, classify unlabeled test segments • Classification framework • Represent speech segments in segment-space • Learn a classifier in segment-space • SVMs • NNs • Bayesian classifiers • …

1 Introduction to GMM based classification 2 Mapping speech segments into segment space 3 Intra-class variability modeling 4 Speaker diarization 5 Summary Outline Intra-Class Variability Modeling for Speech Processing

Estimate Pr(yt|S) Train a universal background model (UBM) GMM using EM For every target speaker S:Train a GMM GS by applying MAP-adaptation Text-Independent Speaker Recognition GMM-Based Algorithm [Reynolds 1995] GMM based speaker recognition • Assuming frame independence: μ1 μ2 μ3 UBM Q1- speaker #1 Q2 - speaker #2 R26 MFCC feature space

Invalid frame independence assumption:Factors such aschannel, emotion, lexical variability, and speaker aging cause frame dependency GMM scoring is inefficient – linear in the length of the audio GMM scoring does not support indexing GMM Based Algorithm - Analysis

Mapping Speech Segments into Segment Space GMM scoring approximation 1/4 Definitions X: training session for target speaker Y: test session Q: GMM trained for X P: GMM trained for Y Goal Compute Pr(Y |Q) using GMMs P and Q only • Motivation • Efficient speaker recognition and indexing • More accurate modeling

Mapping Speech Segments into Segment Space GMM scoring approximation 2/4 Negative cross entropy (1) • Approximating the cross entropy between two GMMs • Matching based lower bound [Aronowitz 2004] • Unscented-transform based approximation [Goldberger & Aronowitz 2005] • Others options in [Hershey 2007]

Mapping Speech Segments into Segment Space GMM scoring approximation 3/4 Matching based approximation (2) Assuming weights and covariance matrices are speaker independent (+ some approximations): (3) Mapping T is induced: (4)

Mapping Speech Segments into Segment Space GMM scoring approximation 4/4 Results Figure and Table taken from: H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007.

Anchor modeling projection [Sturim 2001] efficient but inaccurate MLLR transofrms [Stolcke 2005] accurate but inefficient Kernel-PCA-based mapping [Aronowitz 2007c] Given - a set of objects - a kernel function (a dot product between each pair of objects)Finds a mapping of the objects into Rn which preserves the kernel function. accurate & efficient Other Mapping Techniques

Intra-Class Variability Modeling [Aronowitz 2005b] Introduction • The classic GMM algorithm does not explicitly model intra-speaker inter-session variability: • channel, noise • language • stress, emotion, aging • The frame independence assumption does not hold in these cases! (1) Instead, we can use a more relaxed assumption: (2) which leads to: (3)

Speaker Speaker Old vs. New Generative Models Old Model New Model a PDF over GMM space a GMM a GMM Session GMM generated independently Frame sequence Frame sequence generated independently

Session-GMM Space GMM for session A of speaker #1 GMM for session B of speaker #1 speaker #2 speaker #1 speaker #3 Session-GMM space

Modeling in Session-GMM space 1/2 Recall mapping T induced by the GMM approximation analysis: • is called a supervector • A speaker is modeled by a multivariate normal distribution in supervector space: (3) • A typical dimension of is 50,000*50,000 • is estimated robustly using PCA + regularization: Covariance is assumed to be a low rank matrix with an additional non-zero (noise) diagonal

Modeling in Session-GMM Space 2/2 Estimating covariance matrix 1 1 2 2 2 speaker #2 speaker #1 1 1 1 2 2 2 1 speaker #3 Delta supervector space Supervector space

Experimental Setup Datasets • is estimated from the NIST-2006-SRE corpus • Evaluation is done on the NIST-2004-SRE corpus System description • ETSI MFCC (13-cep + 13-delta-cep) • Energy based voice activity detector • Feature warping • 2048 Gaussians • Target models are adapted from GI-UBM • ZT-norm score normalization

Results 38% reduction in EER

Other Modeling Techniques • NAP+SVMs [Campbell 2006] • Factor Analysis [Kenny 2005] • Kernel-PCA [Aronowitz 2007c] Kernel-PCA based algorithm • Model each supervector as • sS : Common speaker subspace • uU : Speaker unique subspace • S is spanned by a set of development supervectors (700 speakers) • U is the orthogonal complement of S in supervector space • Intra-speaker variability is modeled separately in S and in U • U was found to be more discriminative than S • EER was reduced by 44% compared to baseline GMM

Kernel-PCA Based Modeling Feature space Speaker unique subspace f(x) Session space ux f(y) uy K-PCA x Kernel induced y Tx Ty Anchor sessions Common speaker subspace (Rn)

Goals Detect speaker changes – “speaker segmentation” Cluster speaker segments - “speaker clustering” Motivation for new method Current algorithms do not exploit available training data! (besides tuning thresholds, etc.) Method Explicitly model inter-segment intra-speaker variability from labeled training data, and use for the metric used by change-detection / clustering algorithms. Trainable Speaker Diarization [Aronowitz 2007d]

Dev data BNAD05 (5hr) - Arabic, broadcast news Eval data BNAT05 – Arabic, broadcast news, (207 target models, 6756 test segments) Speaker recognition on pairs of 3s segments

Speaker change detection 2 adjacent sliding windows (3s each) Speaker verification scoring + normalization Speaker clustering Speaker verification scoring + normalization Bottom-up clustering Speaker Error Rate (SER) on BNAT05 Anchor modeling (baseline): 12.9% Kernel-PCA based method: 7.9% Speaker Diarization System & Experiments

Summary 1/2 • A method for mapping speech segments into a GMM supervector space was described • Intra-speaker inter-session variability is modeled in GMM supervector space • Speaker recognition • EER was reduced by 38% on the NIST-2004 SRE • A corresponding kernel-PCA based approach reduces EER by 44% • Speaker diarization • SER for speaker diarization was reduced by 39%.

Summary 2/2 Algorithms based on the proposed framework • Speaker recognition[Aronowitz 2005b; Aronowitz 2007c] • Speaker diarization (“who spoke when”) [Aronowitz 2007d] • VAD (voice activity detection) [Aronowitz 2007a] • Language identification [Noor & Aronowitz 2006] • Gender identification [Bocklet 2008] • Age detection [Bocklet 2008] • Channel/bandwidth classification [Aronowitz 2007d]

Bibliography 1/2 [1] D. A. Reynolds et al., “Speaker identification and verification using Guassian mixture speaker models,” Speech Communications, 17, 91-108. [2] D.E. Sturim et al., “Speaker indexing in large audio databases using anchor models”, in Proc. ICASSP, 2001. [3] H.Aronowitz, D. Burshtein, A. Amir, "Speaker indexing in audio archives using test utterance Gaussian mixture modeling", in Proc. ICSLP, 2004. [4] H.Aronowitz, D. Burshtein, A. Amir, "A session-GMM generative model using test utterance Gaussian mixture modeling for speaker verification", in Proc. ICASSP, 2005. [5] P. Kenny et al., “Factor Analysis Simplified”, in Proc. ICASSP, 2005. [6] H. Aronowitz, D. Irony, D. Burshtein, “Modeling Intra-Speaker Variability for Speaker Recognition ”, in Proc. Interspeech, 2005. [7] J. Goldberger and H. Aronowitz, "A distance measure between GMMs based on the unscented transform and its application to speaker recognition" , in Proc. Interspeech 2005. [8] H. Aronowitz, D. Burshtein, "Efficient Speaker Identification and Retrieval", in Proc. Interspeech 2005.

Bibliography 2/2 [9] A. Stolcke et al., “MLLR Transforms as Features in Speaker Recognition”, in Proc. Interspeech, 2005. [10] E. Noor, H. Aronowitz, "Efficient language Identification using Anchor Models and Support Vector Machines,“ in Proc. ISCA Odyssey Workshop, 2006. [11] W.M. Campbell et al., “SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation”, in Proc. ICASSP 2006. [12] H. Aronowitz, “Segmental modeling for audio segmentation”, in Proc. ICASSP, 2007. [13] J.R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models” ,in Proc. ICASSP 2007. [14] H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007. [15] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability Modeling”, in Proc. Interspeech, 2007. [16] H. Aronowitz, “Trainable Speaker Diarization”, in Proc. Interspeech, 2007. [17] T. Bocklet et al., “Age and Gender Recognition for Telephone Applications Based on GMM Supervectors and Support Vector Machines”, in Proc. ICASSP, 2008.

Thanks! Presentation is available online at: http://aronowitzh.googlepages.com/

Backup slides

Kernel-PCA Based Mapping 2/5 Dot-product feature space Session space f() f(x) x y Kernel trick f(y) Anchor sessions Goals: - Map sessions into feature space - Model in feature space

Kernel-PCA Based Mapping 3/5 Given - kernel K - n anchor sessions Find an orthonormal basis for Method • Compute eigenvectors of the centralized kernel-matrix ki,j = K(Ai,Aj). • Normalize eigenvectors by square-roots of corresponding eigenvalues → {vi} • for is the requested basis

Kernel-PCA Based Mapping 4/5 Common speaker subspace - Speaker unique subspace - Given sessions x, y, may be uniquely represented as: is a mapping x→Rnwith the property:

Kernel-PCA Based Mapping 5/5 Speaker unique subspace Session space Feature space ux uy K-PCA f(x) x y f(y) Tx Ty Anchor sessions Common speaker subspace (Rn)

Modeling in Segment-GMM Supervector Space Segment-GMM supervector space speech silence music Frame sequence: segment #n Frame sequence: segment #2 Frame sequence: segment #1

Goal Segment audio accurately and robustly into speech / silence / music segments. Novel idea Acoustic modeling is usually done on a frame-basis. Segmentation/classification is usually done on a segment-basis (using smoothing). Why not explicitly model whole segments? Note: speaker, noise, music-context, channel (etc.) are constant during a segment. Segmental Modeling for Audio Segmentation

Speech / Silence Segmentation – Results 1/2

Speech / Silence Segmentation – Results 2/2

LID in Session Space English Session space French Arabic Test session Training session

LID in Session Space - Algorithm • Front end: shifted delta cepstrum (SDC). • Represent every train/test session by a GMM super-vector. • Train a linear SVM to classify GMM super-vectors. • Results • EER=4.1% on the NIST-03 Eval (30sec sessions).

Speaker indexing [Sturim et al., 2001] Intersession variability modeling in projected space [Collet et al., 2005] Speaker clustering [Reynolds et al., 2004] Speaker segmentation [Collet et al., 2006] Language identification [Noor and Aronowitz, 2006] Anchor Modeling Projection Given: anchor models λ1,…,λn and session X= x1,…,xF Projection: = average normalized log-likelihood

Intra-Class Variability Modeling Introduction • The classic GMM algorithm does not explicitly model intra-speaker inter-session variability: • Noise • Channel • Language • Changing speaker characteristics – stress, emotion, aging • The frame independence assumption does not hold in these cases! (1) Instead, we get: (2)

Intra-Class Variability Modeling for Speech Processing

Intra-Class Variability Modeling for Speech Processing

Presentation Transcript

Speech Processing

Speech Processing

Intra-Class Variability Modeling for Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

2nd meeting: Multilevel modeling: intra class correlation Subjects for today:

Modeling Reserve Variability

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing

Speech Processing