200 likes | 366 Views
Endpoint Detection ( 端點偵測 ). Jyh-Shing Roger Jang ( 張智星 ) http://mirlab.org/jang MIR Lab, CSIE Dept National Taiwan Univ., Taiwan. Intro. To Endpoint Detection. Endpoint detection (EPD, 端點偵測 ) Goal: Determine the start and end of voice activity Also known as voice activity detection (VAD)
E N D
Endpoint Detection( 端點偵測) Jyh-Shing Roger Jang (張智星) http://mirlab.org/jang MIR Lab, CSIE Dept National Taiwan Univ., Taiwan
Intro. To Endpoint Detection • Endpoint detection (EPD,端點偵測) • Goal: Determine the start and end of voice activity • Also known as voice activity detection (VAD) • Importance • Acts as preprocessing for many recognition tasks • Requires as small computing power as possible • Operation scenarios for speech recognition • Off-line for “pushing to talk” • On-line for “continuously listening”
Two Types of Approaches to EPD • Time-domain methods • Volume • ZCR (zero crossing rate) • HOD (high-order difference) • Frequency-domain methods • Variance of spectrum • Entropy of spectrum
Typical Frameworks to EPD • Thresholding • Simple thresholding • Compute a features (e.g., volume) from each frame • Select a threshold vth • Any frame with a volume higher than vth is considered positive • Combined thresholding • Use two features (e.g., volume and ZCR) to have more complicated decision making • Classification • Take more than one features • Perform binary classification • Negativesil or noise • Positivesound activity • Sequence alignment • Use hidden Markov models (HMM) for sequence alignment
Performance Evaluation for EPD • Types of errors • False rejection positive negative • False acceptance negative positive • Performance evaluation • Start & end position accuracy • Frame-based accuracy
EPD by Volume Only • The simplest method for EPD • Four intuitive way to select vth? • vth = vmax*a • vth = vmedian*b • vth = vmin*g • vth = v1*d
EPD by Volume Only (II) • Unfortunately… • Most of the thresholds fail one way or another. • Dataset-based fine-tuning of a, b, g, d is always advisable. • Under what situations do they fail? • Plosive sounds • Silence too long • Total-zero frame • Unstable frame 1
EPD by Volume Only (III) • A presumably better way to select vth • vlower= 3rd percentile of all ascending volume • vupper= 97th percentile of all ascending volume • vth = (vupper-vlower)*k+vlower • Why do we need to use percentile? • To deal with plosive sounds • To deal total-zero frames • Does it fail? Yes, still, in certain situation…
Example: EPD by Volume • epdByVol01.m
EPD by Volume and ZCR • 以高音量(tu)為標準,決定端點 • 將端點前後延伸到低音量(tl)處 • 再將端點前後延伸到過零率門檻(tzc)處
Example: EPD by Volume and ZCR • epdByVolZcr01.m
EPD by Volume and HOD • How to detect unvoiced sounds reliably? • ZCR • High order difference • Order-1 HOD = sum(abs(diff(s))) • Order-2 HOD = sum(abs(diff(diff(s)))) • Order-3 HOD = sum(abs(diff(diff(diff(s))))) • …
Example: EPD by Vol. and HOD • highOrderDiff01.m
Example: EPD by Vol. and HOD (II) • epdByVolHod01.m
Example: EPD by Vol. and HOD (III) • A hard example: epdByVolHod02.m
EPD by Spectrum • epdBySpectrum01.m
How to Aggregate Spectrum? • To aggregate spectrum for EPD • Entropy function • Geometric mean over arithmetic mean
Spectral Entropy • PDF: • Normalization • Spectral entropy:
N=2 entropyPlot.m N=3 Properties of Entropy
References • References for EPD • Jialin Shen, Jeihweih Hung, Linshan Lee, “Robust entropy-based endpoint detection for speech recognition in noisy environments”, International Conference on Spoken Language Processing, Sydney, 1998