Language modeling for speaker recognition

Dan Gillick January 20, 2004 Language modeling for speaker recognition

Outline • Author identification • Trying to beat Doddington’s “idiolect” modeling strategy (speaker recognition) • My next project Language modeling for speaker recognition

Author ID (undergrad. thesis) Problem: • train models for each of k authors • given some test text written by 1 of those authors, identify the correct author Variations: • different kinds of models • different size test samples • different k Language modeling for speaker recognition

Character n-gram models What? • 27 tokens: a-z, <space> • some text generated from such a trigram model: “you orthad gool of anythilly uncand or prafecaustiont and to hing that put ably” Language modeling for speaker recognition

Character n-gram models Why? • very simple • data sparseness less troublesome than with word n-grams • supposed to be state-of-the-art or at least close to it (Khmelev, D, Tweedie, F.J. “Using Markov Chains for the Identification of Writers”: Literary and Linguistic Computing, 16(4): 299-307. 2001.) Language modeling for speaker recognition

Character n-grams: Setup • task: pick correct author from 10 possible authors • training data: 3 novels for each author • test data: text from a held-out novel • jack-knifing: 4 novels for each of 20 authors Language modeling for speaker recognition

Character n-grams: Results • task: picking 1 author from 10 possible authors • training data size: 3 novels Language modeling for speaker recognition

Character n-gram models Why does it work? • captures some word choice information • picks up word endings (–ing, -tion, -ly, etc.) • not hurt much by data sparseness issues Language modeling for speaker recognition

Key-list models Incentive: • ought to be able to beat character n-grams • develop a new modeling method more focused on that which differentiates between authors (characters and words are both useful for topic recognition, but that doesn’t mean they are best for author recognition) Language modeling for speaker recognition

Key-list models Idea: • convert the text stream into a stream of only authorship-relevant symbols (I called these lists of symbols key-lists) • each symbol is a regular expression to allow for broad definitions (/*tion/ captures any nounification) • text not accounted for by the key-list is represented by <short>, <med>, or <long> markers • build n-gram models from these new streams Language modeling for speaker recognition

Key-list models Sample key-list: sample trigram: <comma> <short> <period> Language modeling for speaker recognition

Key-list models: Results • task: picking 1 author from 10 possible authors • training data size: 3 novels Language modeling for speaker recognition

Key-list models: Results Some other interesting results: • key-lists with just punctuation (as well as <short>, <med>, <long>) performed almost as well as the best key-lists • all key-lists were outperformed by the best n-letter model when test data size < 10,000 chars. but all key-list models eventually surpassed the n-letter models Language modeling for speaker recognition

Key-list models Things I didn’t do: • vary amount of training data • spend a long time trying different key-lists • combine key-list results with each other or with the character results • a lot of other stuff The thesis is available on the web: http://www.dgillick.com/resource/thesis.pdf Language modeling for speaker recognition

Outline • Author identification • Trying to beat Doddington’s “idiolect” modeling strategy (speaker recognition) • My next project Language modeling for speaker recognition

G. Doddington’s LM strategy • create LMs with a limited vocabulary of the most commonly occurring 2000 bigrams • to smooth out zeroes, boost each bigram prob. by 0.001 • score by calculating: logprob(test|target) – logprob(test|bkg) • logprobs are joint probabilities logprob(AB) = logprob(A) + logprob(B|A) Language modeling for speaker recognition

G. Doddington’s LM: Setup Switchboard 1 data: • collected in early ’90s from all over the US • 2,400 (~5 min.) conversations among 543 speakers • corpus divided into 6 splits and tested using jack-knifing through the splits • manual transcripts provided by MS. State Task: • 8 conversation sides used as training data to build models for each target speaker • 1 conversation side used as test data • background model built from 3 splits of held-out data • jack-knifing allowed for almost 10,000 trials Language modeling for speaker recognition

G. Doddington’s LM: Results Notes: • these results are my own attempt to replicate the original experiments • SRI reported EER = 8.65% for this same experiment Language modeling for speaker recognition

Adapted bigram models Incentive: • adapting target models from a much larger background model should yield better estimates of probabilities in the language models Specifically: • use same 2000 bigram vocabulary • target probabilities are a mixture of training probabilities and background probabilities • mixture weight is 2:1 target data:bkg. data Language modeling for speaker recognition

Adapted bigram models: Results Notes: • nearly identical performance • combination of the 2 systems yields almost no improvement • why isn’t the adapted version better? Language modeling for speaker recognition

Can anything improve on 8.68? Trigrams? • use same count threshold to make a list of the top 700 trigrams (“a lot of”, “I don’t know” were among the most common) Character models? • worked well for authorship… • included all character combinations (no limited vocabulary) • tried bigram and trigram models Language modeling for speaker recognition

Scores and combinations adapt. word bigrams EER = 8.89% adapt. word trigrams EER = 11.88% adapt.char. bigrams EER = 13.73% adapt. char. trigrams EER = 17.92% adapted words EER = 8.46% adapted characters EER = 13.24% adapted words + adapted characters EER = 7.89% GD bigrams EER = 8.68% Language modeling for speaker recognition

Final Comparison Language modeling for speaker recognition

What about less training data? 1 conversation-side training • character models might provide more of an advantage with less data? • not so. • GD EER = 22.5% • adapted character EER = 30% • adapted word EER = 20% • maybe these character models pick up on the topic of that 1 conversation • haven’t tried any other size training data Language modeling for speaker recognition

Outline • Author identification • Trying to beat GD’s result • My next project Language modeling for speaker recognition

Key-lists for speaker recognition • key-list n-grams picked up on phrasing (comma and period were valuable tokens) • automatic transcripts don’t have punctuation but they do have pause and duration information • use reg. exps. and duration info. to capture idiosynchratic speaker phrasing • capture other speech information in key-lists? (energy, f0, etc.) Language modeling for speaker recognition

Acknowledgements Thanks to: Anand and Luciana at SRI for trying to help me replicate their results Barbara for providing advice Barry and Kofi for helping with computers and stuff George Language modeling for speaker recognition

Language modeling for speaker recognition

Language modeling for speaker recognition

Presentation Transcript

Speaker Recognition

Speaker Recognition

Speaker Recognition

SPEAKER RECOGNITION

Speaker Recognition

Speaker Recognition

Relevance Language Modeling For Speech Recognition

Speaker Recognition

Speaker Recognition Experiment

Speaker Recognition

Speaker Recognition

A Baseline System for Speaker Recognition

Speaker Recognition

Robust Speaker Recognition

Statistical Language Modeling for Speech Recognition and Information Retrieval

PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

Using Speaker Recognition

Chapter 14 Speaker Recognition

“TalkPrinting” : Improving Speaker Recognition by Modeling Stylistic Features

Language Modeling for Speech Recognition

Acoustic Modeling for Speech Recognition

Speaker Recognition Controversy