170 likes | 417 Views
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs. Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013. Prosody. Prosody – Pitch, Intensity, Rhythm, Silence Prosody carries information about a speaker’s intent and identity .
E N D
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013
Prosody • Prosody – Pitch, Intensity, Rhythm, Silence • Prosody carries information about a speaker’s intent and identity. • Here: prosodic recognition of • Speaking Style • Nativeness • Speaker
Approach • Unsupervised clustering of acoustic/prosodic features. • Sequence modeling of cluster identities
K-Means • K-means is a simple distance based clustering algorithm. • Iterative, non-deterministic (sensitive to initialization) • Must specify K. • We evaluate K between 2 and 100. Optimal value from cross-validation for each task
Dirichlet Process GMMs • Non-parametric infinite mixture model • need a prior of π – the dirichlet process • and a prior over N – a zero mean gaussian • still need to set hyper parametersαand G0 • Stick-breaking & Chinese Restaurant metaphors • Bleiand Jordan 2005Variational Inference • “Rich get Richer” Plate notation from M. Jordan 2005 NIPS tutorial
DPGMM “Rich get Richer” Artificially omit the largest cluster α= 0. 25
Prosodic Event Distribution • ToBI Prosodic Labels • Pitch Accents, Phrase Accent/Boundary Tones Accent Type Distribution Phrase Ending Distribution
Sequence Modeling • SRILM 3-gram model • Backoff & GT smoothing • Clusters learned over all material • Sequence models trained over train sets
Experiments • Classification • Train one SRILM model per class. • Classify by lowest perplexity • Outlier Detection • Train a single model. • Classifier learns a perplexity threshold • Speaking Style, Nativeness, Speaker Recognition • Evaluation • 500 samples between 10-100 syllables (~2-20 seconds) • ToBI, K-Means, DPGMM, DPGMM’ (removing the largest cluster) • 5 fold Cross-validation to learn hyperparameters
Data • Boston Directions Corpus • READ, SPONTANEOUS • 4 speakers (used for Speaker Classification) • Boston University Radio News Corpus • BROADCAST NEWS • 6 speakers • Columbia Games Corpus • SPONTANEOUS DIALOG • 13 speakers • Native Mandarin Chinese Speakers reading BURNC stories. • 4 speakers • All ToBI Labeled
Features • Villing (2004) pseudosyllabification • Syllables with mean intensity below 10dB are considered “silent” • 7 Features • Mean range normalized intensity • Mean range normalized delta intensity • Mean z-score normalized log f0 • Mean z-score normalized delta log f0 • Syllable duration • Duration of previous silence (if any) • Duration of following silence (if any)
Consistency with ToBI labels • V-Measure between • ToBI Accent Types and clusters • ToBIIntonational Phrase-ending Tones and clusters • K-means, solid line • DPGMM, gray line for reference (doesn’t vary by more than 0.001) Accenting Phrasing
Speaking Style Recognition • 4 styles: READ, SPON, BN, DIALOG • Single speaker for evaluation. Outlier Detection - Dialog Classification
Nativeness Recognition • Native (BURNC) vs. Non-Native • Single speaker for evaluation. Outlier Detection - Native Classification
Speaker Recognition • 6 BURNC Speakers • Detect f2b vs. others • 4 BDC Speakers • 6 tasks for training, 3 for testing Outlier Detection Classification
Conclusions • K-means works well to represent prosodic information • DPGMM does not work so well out-of-the-box. • Despite being non-parametric, hyperparameter setting is still critically important • Future Work • Larger acoustic/prosodic feature set. • requires pre-processing • Evaluating the universality of prosodic representations • Integration of K-means and DPGMM. • Use one to seed the other.
Thank you andrew@cs.qc.cuny.edu http://speech.cs.qc.cuny.edu