200 likes | 339 Views
“TalkPrinting” : Improving Speaker Recognition by Modeling Stylistic Features. S. Kajarekar, K. Sonmez, L. Ferrer, V. Gadde, A. Venkataraman, E. Shriberg , A. Stolcke, H. Bratt SRI International, Menlo Park, CA Funding: KDD Notes: 9 months into project, Results updated from paper. Outline.
E N D
“TalkPrinting”: Improving Speaker Recognition by Modeling Stylistic Features S. Kajarekar, K. Sonmez, L. Ferrer, V. Gadde, A. Venkataraman, E. Shriberg, A. Stolcke, H. Bratt SRI International, Menlo Park, CA Funding: KDD Notes: 9 months into project, Results updated from paper NSF-NIJ Symposium, June 2-3, 2003
Outline • Project motivation and goal • Overview of approach • Selected results to date: • TalkPrint features: lexical, prosodic • Adding TalkPrint features to baseline • Effect of amount of training data • Effect of amount of test data • Summary, conclusions and future work NSF-NIJ Symposium, June 2-3, 2003
Motivation • Significant distal communication occurs by voice only (e.g., telephone conversations) • Vast amounts of data captured for intelligence, law enforcement • Analysts can listen to only small percentage • Need technology to help filter out the majority of uninteresting cases and mark the ones that contain interesting speakers and/or content • Must be completely automatic from audio NSF-NIJ Symposium, June 2-3, 2003
Objective • Model patterns in way people talk to find out: • Who is talking? (speaker recognition) • What type of conversation is it? (style recognition) • Is a speaker acting strange? (anomaly detection) (emotion, cognitive state, health, etc) • Today: report on speaker recognition only • “Tag” speech data with this information (probabilistically), to aid analysts NSF-NIJ Symposium, June 2-3, 2003
Standard Approach • Slice speech into tiny (10 ms) time regions, model energy/frequency distributions • Each speaker = Gaussian mixture model (GMM) • Frames are independent (unordered) • No longer range information O k a y NSF-NIJ Symposium, June 2-3, 2003
Limitations of Current Approach • Works very well under certain conditions, BUT: • Degrades with channel variation and noise • Can’t distinguish people with similar vocal tracts • Fails to capture longer-range properties: • habitual word patterns, disfluency rates and types • prosodic (pausing, temporal, and intonation) patterns • turn-length, turn-taking patterns • Long-range cues also useful for style, anomalies. NSF-NIJ Symposium, June 2-3, 2003
New Approach: ‘TalkPrinting’ • Capture behavioral patterns in how a person talks (speaking rate, intonation, word usage, etc….) • Humans use these patterns (e.g. ID through wall) • Patterns reflect different underlying causes: dialectal, social, pragmatic, cognitive, affective • While behavioral, many patterns hard to fake well • Combine TalkPrint features with conventional (voice print) features NSF-NIJ Symposium, June 2-3, 2003
Decision Paradigm • Build model for general population • Build model for the target speaker • Compare the models using a likelihood-ratio test: score is the difference in log scores for the data given each model • This score is compared to a threshold for making discrete decisions • Threshold determines the tradeoff between error types (misses, false alarms) NSF-NIJ Symposium, June 2-3, 2003
Experiments • Task: speaker ID (no data yet for others) • Data from telephone conversations on various topics (Switchboard corpus) • Built competitive baseline system in order to assess gain from TalkPrint features • Built TalkPrint systems from new features • To date, fused systems at the score level, using neural network (updated from paper) • Eventual goal: fuse at the feature level NSF-NIJ Symposium, June 2-3, 2003
Standard Evaluation Metrics • Annual speaker recognition evaluations conducted by NIST • Various metrics: • Detection Error Trade-off curves: shows the dependence between miss and false alarm rates, • Equal Error Rate (EER): point on DET curve at which miss rate = false alarm rate • Cost-weighted error rate (application dependent) NSF-NIJ Symposium, June 2-3, 2003
Questions: • Can we improve speaker recognition by augmenting baseline system with: • Language features? • Prosodic (rate, rhythm, melody) features? • How is performance affected by: • Amount of training data? • Amount of test data? NSF-NIJ Symposium, June 2-3, 2003
Language Features & ASR • Language model yields probability of frequent words/pairs: Uhhuh, yeah, I mean, you know, etc. • Need to recognize the words first: requires large-vocabulary conversational ASR engine • Word error rates on conversational ASR high (>20%) even for state of the art systems • Used purposely stripped-down SRI’s LVCSR system (SOA): 38% WER on this data • Test of whether can get by with high WER! NSF-NIJ Symposium, June 2-3, 2003
Prosodic Features • Long line of work at SRI on prosody modeling • Notable aspect: model prosody directly from the signal (no intermediate phonological labels) • Raw feature types: • Duration (phonemes, syllables, words; normalized) • Pause location and duration • Intonation (pitch contours, stylized using spline fits) • Energy (also stylized contours) • Duration and pause features use time alignments from recognition hypothesis NSF-NIJ Symposium, June 2-3, 2003
Sample Prosodic Features • Duration features: • vector of durations of phones in a word and of “states” (3 subphone units) in a phone, e.g.: “Tucson” t uw s ah n • “NERFs” (New Extraction Region Features) • Sample: pitch and duration features in regions between consecutive pauses (few parameters) NSF-NIJ Symposium, June 2-3, 2003
Results: Features 16 training conversations, 1 test conversation • LM and Dur combine well with each other • Fusing LM+Dur with baseline dramatically improves performance • 1st NERF results show further gain (< misses) • Note: DCF penalizes false alarms (access apps); filtering apps would penalize misses NSF-NIJ Symposium, June 2-3, 2003
Effect of Amount of Training Data • 1 conversation = approx..3 minutes • Performance improves with added training data • Effect similar for both baseline & TalkPrint systems • Intelligence apps likely to keep adding training data NSF-NIJ Symposium, June 2-3, 2003
Effect of Amount of Test Data • EER: FA = Misses • Combined system = Baseline + Dur • Duration significantly aids performance • Helps even at 10sec • Baseline seems to saturate at 2 mins, duration keeps improving with length NSF-NIJ Symposium, June 2-3, 2003
Summary & Conclusions (1) • Automatic tagging of massive amounts of audio data for speaker, likely content, and anomalies, can preprocess data for human analysts • Conventional speaker recognition fails to capture beyond-the-frame behavioral patterns • We find such behavioral patterns aid speaker recognition when added to a state-of-the-art baseline system (frame-based features) • Useful TalkPrint features include both language and prosody NSF-NIJ Symposium, June 2-3, 2003
Summary & Conclusions (2) • Language and prosody features complement both each other and baseline features • Both language and prosody features help despite nearly 40% of the words wrong! • Performance of TalkPrint features improves with both added training and added test data • In contrast, baseline features appear to saturate after about 2 minutes of test data NSF-NIJ Symposium, June 2-3, 2003
Future Work • Improve TalkPrint features • Develop feature selection and fusion methods • Investigate effect of various factors: • word error rate of ASR system • noise (are TalkPrint features more robust?) • Work with government to assess performance on relevant conversational data • Extend approach to capture information about type of conversation and to detect anomalies NSF-NIJ Symposium, June 2-3, 2003