1 / 16

Ekapol Chuangsuwanich and James Glass

Robust Voice Activity Detector for Real World Applications Using Harmonicity and Modulation frequency. Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge , Massachusetts 02139,USA 2012/07/2 汪逸婷. Outline. Introduction Harmonicity

Download Presentation

Ekapol Chuangsuwanich and James Glass

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robust Voice Activity Detector for Real World Applications Using Harmonicity and Modulation frequency EkapolChuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷

  2. Outline • Introduction • Harmonicity • Modulation Frequency • Experiments and Discussions • Speech/Non-speech detection capabilities • Effect on ASR performance • Conclusion

  3. Introduction • Voice Activity Detection(VAD) is the process of identifying segments of speech in a continuous audio stream. • First stage of a speech processing application. • Used to both reduce computation by eliminating unnecessary transmission and processing of non-speech segments, as well as reduce potential mis-recognition errors in such segments.(binary) • In this paper, we consider the task of giving commands to an autonomous forklift.

  4. Introduction • In high quality recording conditions, energy-based methods perform well. • In noisy conditions, energy-based measures ofter produce a considerable number of false alarms. • Large variety of other features have been investigated for use in noisy environments. • Tuning parameters. • Difficulties dealing with non-stationary or instantaneous types of noises that are frequent in our work.

  5. Introduction • Harmonicityis a basic property of any periodic signal, it is not useful by itself. • Some works shows good results even at very low SNR conditions. • Modulation frequencies which measure the temporal rate of change of energy across different frequency band. • Some studies about purely MF-inspired set of features for discriminating between speech and non-speech which gave good generalization to noise types not included in training.

  6. Harmonicity • Figure 1: Example of distant speech from the forklift database.

  7. Harmonicity • Autocorrelation :as a function of the lag • Periodicity measure will yield a high value for pure tones.(use bandpasscepstralliftering)

  8. Harmonicity

  9. Modulation Frequency

  10. Experiments-Speech/Non-speech detection capabilities • Database consisting of speech commands from 26 subjects with added noise to simulate a variety of SNR values ranging from -5 to 15 dB. • Recorded with an array microphone. • Classification was without any additional post-processing. • Transition between speech and non-speech were excluded. • Training 4 min of speech, 22 min of non-speech.

  11. Experiments-Speech/Non-speech detection capabilities • For comparison, include results based on Relative Spectral Entropy(RSE), Long-Term Signal Variability(LTSV), and statistical model-based VADs using MFCCs and MF as features to Gaussian mixtrue models(GMMs). • EER: Equal error rate • FAR: False alarm rate

  12. Experiments-Speech/Non-speech detection capabilities • Performance varied specific kinds of noise.

  13. Experiments-Speech/Non-speech detection capabilities • EER and FAR for each noise type are usually obtained at different thresholds.

  14. Experiments-Effect on ASR performance • The SNR values ranged from 5 to 25 dB. • Corpus: four microphone channels, 10 hours long, 400 command words. • Commands are sparse, forklift is mostly idle waiting for commands or taking the time to executed commands. • ASR was performed using PocketSUMMIT, had a vocabulary size of 57 words, that out of domain(OOD) command was modeled by a single GMM. • ASR model was trained on over 3600 utterances of commands from 18 talkers.

  15. Experiments-Effect on ASR performance • Examined the influence of different VAD systems on the ASR results in the task of commanding an autonomous forklift in real world environments.

  16. Conclusion • Described the task of VAD on distant speech in low SNR environments for an autonomous robotic forklift. • Designed a two-stage approach for speech /non-speech classification. • Parallel SVM outperformed classification based on the whole MF spectrum. Combination MF and simple harmonicity measure helped reduce false alarm rate by another 9% at low miss rates. • In ASR, VAD outperformed standard VADs and achieved a WER very close to that of hand labeled end point.

More Related