Endpoint Detection ( 端點偵測 )

Endpoint Detection( 端點偵測) Jyh-Shing Roger Jang (張智星) http://mirlab.org/jang MIR Lab, CSIE Dept National Taiwan Univ., Taiwan

Intro. To Endpoint Detection • Endpoint detection (EPD,端點偵測) • Goal: Determine the start and end of voice activity • Also known as voice activity detection (VAD) • Importance • Acts as preprocessing for many recognition tasks • Requires as small computing power as possible • Operation scenarios for speech recognition • Off-line for “pushing to talk” • On-line for “continuously listening”

Two Types of Approaches to EPD • Time-domain methods • Volume • ZCR (zero crossing rate) • HOD (high-order difference) • Frequency-domain methods • Variance of spectrum • Entropy of spectrum

Typical Frameworks to EPD • Thresholding • Simple thresholding • Compute a features (e.g., volume) from each frame • Select a threshold vth • Any frame with a volume higher than vth is considered positive • Combined thresholding • Use two features (e.g., volume and ZCR) to have more complicated decision making • Classification • Take more than one features • Perform binary classification • Negativesil or noise • Positivesound activity • Sequence alignment • Use hidden Markov models (HMM) for sequence alignment

Performance Evaluation for EPD • Types of errors • False rejection positive  negative • False acceptance negative  positive • Performance evaluation • Start & end position accuracy • Frame-based accuracy

EPD by Volume Only • The simplest method for EPD • Four intuitive way to select vth? • vth = vmax*a • vth = vmedian*b • vth = vmin*g • vth = v1*d

EPD by Volume Only (II) • Unfortunately… • Most of the thresholds fail one way or another. • Dataset-based fine-tuning of a, b, g, d is always advisable. • Under what situations do they fail? • Plosive sounds • Silence too long • Total-zero frame • Unstable frame 1

EPD by Volume Only (III) • A presumably better way to select vth • vlower= 3rd percentile of all ascending volume • vupper= 97th percentile of all ascending volume • vth = (vupper-vlower)*k+vlower • Why do we need to use percentile? • To deal with plosive sounds • To deal total-zero frames • Does it fail? Yes, still, in certain situation…

Example: EPD by Volume • epdByVol01.m

EPD by Volume and ZCR • 以高音量（tu）為標準，決定端點 • 將端點前後延伸到低音量（tl）處 • 再將端點前後延伸到過零率門檻（tzc）處

Example: EPD by Volume and ZCR • epdByVolZcr01.m

EPD by Volume and HOD • How to detect unvoiced sounds reliably? • ZCR • High order difference • Order-1 HOD = sum(abs(diff(s))) • Order-2 HOD = sum(abs(diff(diff(s)))) • Order-3 HOD = sum(abs(diff(diff(diff(s))))) • …

Example: EPD by Vol. and HOD • highOrderDiff01.m

Example: EPD by Vol. and HOD (II) • epdByVolHod01.m

Example: EPD by Vol. and HOD (III) • A hard example: epdByVolHod02.m

EPD by Spectrum • epdBySpectrum01.m

How to Aggregate Spectrum? • To aggregate spectrum for EPD • Entropy function • Geometric mean over arithmetic mean

Spectral Entropy • PDF: • Normalization • Spectral entropy:

N=2 entropyPlot.m N=3 Properties of Entropy

References • References for EPD • Jialin Shen, Jeihweih Hung, Linshan Lee, “Robust entropy-based endpoint detection for speech recognition in noisy environments”, International Conference on Spoken Language Processing, Sydney, 1998

Endpoint Detection ( 端點偵測 )