1 / 35

Mixed Signals: Speech Activity Detection and Crosstalk in the Meetings Domain

Mixed Signals: Speech Activity Detection and Crosstalk in the Meetings Domain. Kofi A. Boakye International Computer Science Institute. Overview. Motivation Techniques Meetings Domain Crosstalk compensation Initial Results and Modifications Subsequent results Development Evaluation

gladys
Download Presentation

Mixed Signals: Speech Activity Detection and Crosstalk in the Meetings Domain

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mixed Signals: Speech Activity Detection and Crosstalk in the Meetings Domain Kofi A. Boakye International Computer Science Institute Speech Group Lunch Talk

  2. Overview • Motivation • Techniques • Meetings Domain • Crosstalk compensation • Initial Results and Modifications • Subsequent results • Development • Evaluation • Conclusions Speech Group Lunch Talk

  3. Motivation Audio signal contains isolated non-speech phenomena • Externally produced Ex’s: Car honking, door slamming, telephone ringing • Speaker produced Ex’s: Breathing, laughing, coughing • Non-production Ex’s: Pause, silence Speech Group Lunch Talk

  4. Motivation • Some of these can be dealt with by recognizer • Explicit modeling • “Junk” model • Many cannot • Non-speaker produced phenomena is too large and too rare for good modeling • Desire: prevent non-speech regions from being processed by recognizer → Speech Activity Detection (SAD) Speech Group Lunch Talk

  5. Techniques Two Main Approaches • Threshold based • Decision performed according to one or more (possibly adaptive) thresholds • Method very sensitive to variations • Classifier based • Ex’s: Viterbi decoder, ANN, GMM • Method relies on general statistics rather than local information • Requires fairly intensive training Speech Group Lunch Talk

  6. Techniques Both threshold and classifier approaches typically make use of certain acoustic features • Energy • Fundamental component to many SADs • Generally lacks robustness to noise and impulsive interference • Zero-crossing rate - Successful as a correction term in energy-based systems • Harmonicity (e.g., via autocorrelation) • Relates to voicing • Performs poorly in unvoiced speech regions Speech Group Lunch Talk

  7. Meetings Domain • With initiatives such as M4, AMI, and our own ICSI meeting recorder project, ASR in meetings is of strong interest • Objective: Determine who said what, when, using information from multiple sensors (mics) Speech Group Lunch Talk

  8. Meetings Domain • Sensors of interest: personal mics • Come as either headset or lapel units • Should be able to obtain fairly high transcripts from these channels • Domain has certain complexities that make task challenging, namely variability in 1) Number of speakers 2) Number, type, and location of sensors 3) Acoustic conditions Speech Group Lunch Talk

  9. target speech crosstalk Crosstalk • As a preprocessing step to ASR, SAD is also affected by these to varying levels • Key culprit in poor SAD performance: crosstalk • Example Speech Group Lunch Talk

  10. Crosstalk compensation • Generate energy signals for each audio channel and subtract minimum energy from each • Minimum energy serves as “noise floor” Speech Group Lunch Talk

  11. Crosstalk compensation • Compute mean energy of non-target channels Speech Group Lunch Talk

  12. Crosstalk compensation • Subtract mean from target channel Speech Group Lunch Talk

  13. Crosstalk compensation • Apply thresholds using Schmitt trigger • Merge segments with inter-segment pauses less than a set number • Suppress segments of duration less than a set number • Apply head and tail collars to avoid “clipping” segments Speech Group Lunch Talk

  14. Initial Results • Performance was examined for RT-04 Meetings development data • 10 minute excerpts from 8 meetings, 2 from each of • ICSI • CMU • LDC • NIST Note: CMU and LDC data obtained from lapel mics Speech Group Lunch Talk

  15. Initial Results SRI Baseline: My SAD: Verdict: Sad results  Possible reason: sensitivity of thresholds Speech Group Lunch Talk

  16. Modification: Segment Intersection • Idea: System ideally should be generating segments from the target speaker only. By intersecting these segments with another SAD, we can filter out crosstalk and reduce insertion errors • Modified SAD to have zero threshold • Sensitivity needed to address deletions • False alarms addressed by intersection Speech Group Lunch Talk

  17. Modification: Segment Intersection • SRI SAD • Two-class HMM using GMMs for speech and non-speech • Regions merged and padded to satisfy constraints (min duration and min pause) • Constraints optimized for recognition accuracy S NS Speech Group Lunch Talk

  18. New Results SRI Baseline: Intersection SAD: Verdict: Happy results  Speech Group Lunch Talk

  19. New Results SRI Baseline: Intersection SAD: Note that improvement comes largely from reduced insertions Speech Group Lunch Talk

  20. New Results SRI Baseline: Intersection SAD: Hand segmentation Speech Group Lunch Talk

  21. New Results • Site-level breakdown: WERs Insertions Speech Group Lunch Talk

  22. Graphical Example SRI SAD My SAD Intersection Hand Segs Speech Group Lunch Talk

  23. Results: Eval04 • Applied 2004 Eval system to Eval04 data • 11 minute excerpts from 8 meetings, 2 from each of • ICSI • CMU • LDC • NIST Note: No lapel mics (with exception of 1 ICSI channel) Speech Group Lunch Talk

  24. Results: Eval04 • Applied 2004 Eval system to Eval04 data SRI Baseline: Intersection SAD: Hand segmentation Speech Group Lunch Talk

  25. Results: Eval04 • Applied 2004 Eval system to Eval04 data SRI Baseline: Intersection SAD: Hand segmentation Speech Group Lunch Talk

  26. Results: AMI Dev Data • Applied 2005 CTS (not meetings) system with AMI-adapted LM to AMI development data SRI Baseline: Intersection SAD: Hand segmentation Speech Group Lunch Talk

  27. Results: AMI Dev Data • Applied 2005 CTS (not meetings) system with AMI-adapted LM to AMI development data SRI Baseline: Intersection SAD: Hand segmentation Speech Group Lunch Talk

  28. Moment of Truth: Eval05 • ICSI System • SRI SAD • GMMs trained on 2004 training data for non-AMI meetings and 2005 AMI data for AMI meetings • Recognizer • Based on models from SRI’s RT-04F CTS system w/ Tandem/HATS MLP features • Adapted to meetings using ICSI, NIST, and AMI data • LMs trained on conversational speech, broadcast news, and web texts and adapted to meetings • Vocab consisted of 54K+ words, from CTS system and ICSI, CMU, NIST, and AMI training transcripts Speech Group Lunch Talk

  29. Moment of Truth: Eval05 SRI Baseline: Intersection SAD: Hand segmentation !!! Cf. AMI entry: 30.6 WER Speech Group Lunch Talk

  30. Moment of Truth: Eval05 SRI Baseline: Intersection SAD: Hand segmentation Speech Group Lunch Talk

  31. Moment of Truth: Eval05 • Site-level breakdown: WERs Insertions Speech Group Lunch Talk

  32. Moment of Truth: Eval05 • One culprit: 3 NIST channels with no speech • Example (un-mic’d speaker?) SRI SAD My SAD Intersection Hand Segs Speech Group Lunch Talk

  33. Conclusions • Crosstalk compensation is successful at reducing insertions while not adversely affecting deletions, resulting in lower WER • Demonstrates power of combining information sources • For 2005 Meeting Eval, gap between automatic and hand segments quite large • Initial analysis identifies zero-speech channels • Further analysis necessary Speech Group Lunch Talk

  34. Acknowledgments • Andreas Stolcke • Chuck Wooters • Adam Janin Speech Group Lunch Talk

  35. Fin Speech Group Lunch Talk

More Related