170 likes | 357 Views
Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments. 張智星 Jang@cs.nthu.edu.tw http://www.cs.nthu.edu.tw/~jang. Reference.
E N D
Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星 Jang@cs.nthu.edu.tw http://www.cs.nthu.edu.tw/~jang
Reference • Jialin Shen, Jeihweih Hung, Linshan Lee, “Robust entropy-based endpoint detection for speech recognition in noisy environments”, International Conference on Spoken Language Processing, Sydney, 1998
Summary • Entropy-based algorithm for accurate and robust endpoint detection for speech recognition under noisy environments • Better than energy-based algorithms in both detection accuracy and recognition performance • Error reduction: 16%
Motivation • Energy-based endpoint detection becomes less reliable when dealing with non-stationary noise and sound artifacts such as lip smacks, heavy breathing and mouth clicks, etc. • Spectral entropy is effective in distinguishing the speech segments from the non-speech parts.
Spectral Entropy • PDF: • Normalization • Spectral entropy:
N=2 entropyPlot.m N=3 Properties of Entropy
Entropy Weighting • A set of weighting factors can be applied: • These weighting factors are statistically estimated from a large collection of speech signals.
Endpoint Detection • The sum of the spectral entropy values over a duration of frames (20 frames) is first evaluated and smoothed by a median filter • Some thresholds are used to detect the beginning and ending boundaries of the embedded speech segments • A short period of background noise is first taken as the reference for some initial boundary detection process. • Short speech segments (<100ms) are rejected.
Experiment Settings • Speech database • Isolated digits in Mandarin Chinese produced by 100 speakers (10 speakers for test, others for training) • Speech features: 12-order MFCC and 12-order delta MFCC • Models • Continuous-density HMM • 6 states/digits, 3 mixture/state
Experiment Settings • Noise • NOISEX-92 noise-in-speech database • White noise, pink noise, volvo noise (car noise), F16 noise, machinegun noise • Sound artifacts • Breath noise, cough noise and mouse click noise.
Something Not Clear… • What is the sample rate? Bit resolution? • What is the frame size and overlap? • What is the order of the median filter? • How to use the “short period of background noise”? • What is the value for the thresholds of spectral entropy for determining boundaries? • What are the values for d1 and d2?