“TalkPrinting” : Improving Speaker Recognition by Modeling Stylistic Features

“TalkPrinting”: Improving Speaker Recognition by Modeling Stylistic Features S. Kajarekar, K. Sonmez, L. Ferrer, V. Gadde, A. Venkataraman, E. Shriberg, A. Stolcke, H. Bratt SRI International, Menlo Park, CA Funding: KDD Notes: 9 months into project, Results updated from paper NSF-NIJ Symposium, June 2-3, 2003

Outline • Project motivation and goal • Overview of approach • Selected results to date: • TalkPrint features: lexical, prosodic • Adding TalkPrint features to baseline • Effect of amount of training data • Effect of amount of test data • Summary, conclusions and future work NSF-NIJ Symposium, June 2-3, 2003

Motivation • Significant distal communication occurs by voice only (e.g., telephone conversations) • Vast amounts of data captured for intelligence, law enforcement • Analysts can listen to only small percentage • Need technology to help filter out the majority of uninteresting cases and mark the ones that contain interesting speakers and/or content • Must be completely automatic from audio NSF-NIJ Symposium, June 2-3, 2003

Objective • Model patterns in way people talk to find out: • Who is talking? (speaker recognition) • What type of conversation is it? (style recognition) • Is a speaker acting strange? (anomaly detection) (emotion, cognitive state, health, etc) • Today: report on speaker recognition only • “Tag” speech data with this information (probabilistically), to aid analysts NSF-NIJ Symposium, June 2-3, 2003

Standard Approach • Slice speech into tiny (10 ms) time regions, model energy/frequency distributions • Each speaker = Gaussian mixture model (GMM) • Frames are independent (unordered) • No longer range information O k a y NSF-NIJ Symposium, June 2-3, 2003

Limitations of Current Approach • Works very well under certain conditions, BUT: • Degrades with channel variation and noise • Can’t distinguish people with similar vocal tracts • Fails to capture longer-range properties: • habitual word patterns, disfluency rates and types • prosodic (pausing, temporal, and intonation) patterns • turn-length, turn-taking patterns • Long-range cues also useful for style, anomalies. NSF-NIJ Symposium, June 2-3, 2003

New Approach: ‘TalkPrinting’ • Capture behavioral patterns in how a person talks (speaking rate, intonation, word usage, etc….) • Humans use these patterns (e.g. ID through wall) • Patterns reflect different underlying causes: dialectal, social, pragmatic, cognitive, affective • While behavioral, many patterns hard to fake well • Combine TalkPrint features with conventional (voice print) features NSF-NIJ Symposium, June 2-3, 2003

Decision Paradigm • Build model for general population • Build model for the target speaker • Compare the models using a likelihood-ratio test: score is the difference in log scores for the data given each model • This score is compared to a threshold for making discrete decisions • Threshold determines the tradeoff between error types (misses, false alarms) NSF-NIJ Symposium, June 2-3, 2003

Experiments • Task: speaker ID (no data yet for others) • Data from telephone conversations on various topics (Switchboard corpus) • Built competitive baseline system in order to assess gain from TalkPrint features • Built TalkPrint systems from new features • To date, fused systems at the score level, using neural network (updated from paper) • Eventual goal: fuse at the feature level NSF-NIJ Symposium, June 2-3, 2003

Standard Evaluation Metrics • Annual speaker recognition evaluations conducted by NIST • Various metrics: • Detection Error Trade-off curves: shows the dependence between miss and false alarm rates, • Equal Error Rate (EER): point on DET curve at which miss rate = false alarm rate • Cost-weighted error rate (application dependent) NSF-NIJ Symposium, June 2-3, 2003

Questions: • Can we improve speaker recognition by augmenting baseline system with: • Language features? • Prosodic (rate, rhythm, melody) features? • How is performance affected by: • Amount of training data? • Amount of test data? NSF-NIJ Symposium, June 2-3, 2003

Language Features & ASR • Language model yields probability of frequent words/pairs: Uhhuh, yeah, I mean, you know, etc. • Need to recognize the words first: requires large-vocabulary conversational ASR engine • Word error rates on conversational ASR high (>20%) even for state of the art systems • Used purposely stripped-down SRI’s LVCSR system (SOA): 38% WER on this data • Test of whether can get by with high WER! NSF-NIJ Symposium, June 2-3, 2003

Prosodic Features • Long line of work at SRI on prosody modeling • Notable aspect: model prosody directly from the signal (no intermediate phonological labels) • Raw feature types: • Duration (phonemes, syllables, words; normalized) • Pause location and duration • Intonation (pitch contours, stylized using spline fits) • Energy (also stylized contours) • Duration and pause features use time alignments from recognition hypothesis NSF-NIJ Symposium, June 2-3, 2003

Sample Prosodic Features • Duration features: • vector of durations of phones in a word and of “states” (3 subphone units) in a phone, e.g.: “Tucson” t uw s ah n • “NERFs” (New Extraction Region Features) • Sample: pitch and duration features in regions between consecutive pauses (few parameters) NSF-NIJ Symposium, June 2-3, 2003

Results: Features 16 training conversations, 1 test conversation • LM and Dur combine well with each other • Fusing LM+Dur with baseline dramatically improves performance • 1st NERF results show further gain (< misses) • Note: DCF penalizes false alarms (access apps); filtering apps would penalize misses NSF-NIJ Symposium, June 2-3, 2003

Effect of Amount of Training Data • 1 conversation = approx..3 minutes • Performance improves with added training data • Effect similar for both baseline & TalkPrint systems • Intelligence apps likely to keep adding training data NSF-NIJ Symposium, June 2-3, 2003

Effect of Amount of Test Data • EER: FA = Misses • Combined system = Baseline + Dur • Duration significantly aids performance • Helps even at 10sec • Baseline seems to saturate at 2 mins, duration keeps improving with length NSF-NIJ Symposium, June 2-3, 2003

Summary & Conclusions (1) • Automatic tagging of massive amounts of audio data for speaker, likely content, and anomalies, can preprocess data for human analysts • Conventional speaker recognition fails to capture beyond-the-frame behavioral patterns • We find such behavioral patterns aid speaker recognition when added to a state-of-the-art baseline system (frame-based features) • Useful TalkPrint features include both language and prosody NSF-NIJ Symposium, June 2-3, 2003

Summary & Conclusions (2) • Language and prosody features complement both each other and baseline features • Both language and prosody features help despite nearly 40% of the words wrong! • Performance of TalkPrint features improves with both added training and added test data • In contrast, baseline features appear to saturate after about 2 minutes of test data NSF-NIJ Symposium, June 2-3, 2003

Future Work • Improve TalkPrint features • Develop feature selection and fusion methods • Investigate effect of various factors: • word error rate of ASR system • noise (are TalkPrint features more robust?) • Work with government to assess performance on relevant conversational data • Extend approach to capture information about type of conversation and to detect anomalies NSF-NIJ Symposium, June 2-3, 2003

“TalkPrinting” : Improving Speaker Recognition by Modeling Stylistic Features

“TalkPrinting” : Improving Speaker Recognition by Modeling Stylistic Features

Presentation Transcript

Sketching and viewing with Marvin Features, tips and tricks

“LAYOUT” EFFECTS HOW LAYOUTS CAN CHANGE CMOS AND HOW DO CIRCUIT SIMULATIONS ACCOUNT FOR THE CHANGE

AS 9 - REVENUE rECOGNITION

Speech Recognition and Understanding

Chapter 8 (part B): Data Warehouse Modeling

Information Modeling Requirement Analysis

Measurement, Modeling, and Analysis of the Internet: Part II

Speech and Language Modeling

Rapid Object Detection using a Boosted Cascade of Simple Features

Industrial Electronic Features

New Features in HDF5

Gesture Recognition

Features

Professional Writing in English Why stylistic editing is needed

multimodal+emotion+recognition

Unified Modeling Language (UML)

INTRODUCTION ,MODELING CONCEPTS,CLASS MODELING

6.870 Object Recognition and Scene Understanding

Speaker Name Speaker Title Speaker Affiliation

Multilevel Modeling

Speech Recognition

Introduction to UML, the Unified Modeling Language