290 likes | 485 Views
Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition. Qi Li , Senior Member, IEEE , Jinsong Zheng, Augustine Tsai, and Qiru Zhou , Member, IEEE Presented by Chen Hung_Bin. outline. Introduction endpoint detection Endpoint detection include
E N D
Robust Endpoint Detection and Energy Normalizationfor Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine Tsai, and Qiru Zhou, Member, IEEE Presented by Chen Hung_Bin
outline • Introduction endpoint detection • Endpoint detection include • Endpoint detection (Filter) • State Transition • Experiment
Introduction • The detection of the presence of speech embedded in various types of nonspeech events and background noise is called endpoint detection, speech detection, or speech activity detection. • In this paper, address endpoint detection by sequential and batch-mode processes to support real-time recognition. • sequential: automatic speech recognition (ASR) • batch-mode: utterances are usually as short as a few seconds and the delay in response is usually small.
Introduction • Endpoint detection include • energy threshold • pitch detection • spectrum analysis • cepstral analysis • zero-crossing rate • periodicity measure • chi-square test • entropy • hybrid detection
Introduction • energy
Introduction • A Mandarin digit “eight.” • spectrum
Introduction • zero-crossing rate
Introduction • The chi-square test given by • The hypothesis test can thus be written as
Introduction • entropy
Introduction • endpoint detection crucial :accuracy and speed for several reasons. • It is hard to model noise and silence accurately in changing environments. • if silence frames can be removed prior to recognition, the accumulated utterance likelihood scores will focus more on the speech. • The cepstral mean subtraction (CMS), a popular algorithm for robust speech recognition, accurate endpoints to compute the mean of speech frames precisely in order to improve recognition accuracy.
Introduction • point out in this study : • The more accurately we can detect endpoints, the better we can do on real-time energy normalization. • requirements: • Accurate location of detected endpoints; • Robust detection at various noise levels; • Low computational complexity; • Fast response time; • And simple implementation.
Endpoint Detection (Filter) • First, we need a detector (filter) that meets the following general requirements: • 1) invariant outputs at various background energy levels; • 2) capability of detecting both beginning and ending points; • 3) short time delay or look-ahead; • 4) limited response level; • 5) maximum output signal-to-noise ratio (SNR) at endpoints; • 6) accurate location of detected endpoints; • 7) maximum suppression of false detection.
Less then 25 points Filter for Both Beginning- and Ending-Edge Detection • choose the filter size • W =13 • s = 0.5385 • A = 0.2208 • Let H(i)=h(i-13); then the filter has 25 points in total with a 24-frame look-ahead since H(1) both H(25) and are zeros. Count 30
Filter for Both Beginning- and Ending-Edge Detection • In this paper choose the filter size Shape of the optimal filter for beginning edge detection, plotted as h (t), with W = 7 and s = 1 Shape of the optimal filter for ending edge detection, plotted as h (t), with W = 35 and s = 0:2.
Batch-mode Endpoint Detection Lines E, F, G, and H indicate the locations of two pairs of beginning and ending points. Output of the beginning-edge filter (solid line) and ending-edge filter (dashed line)
State Transition Diagram • Using a three-state transition diagram to make final decisions. • silence, in-speech, and leaving-speech. 8 KHz sampling rate State transition diagram for endpoint decision. (a) energy contour of digit “4” (b) filter outputs and state transitions.
Real-Time Energy Normalization • Purposing of energy normalization is to normalize the utterance energy g(t), such that the largest value of energy is close to zero.
Real-Time Energy Normalization • example (a) Energy contours of “4-327-631-Z214” from original utterance (bottom, 20 dB SNR) and after adding car noise (top, 5 dB SNR). (b) Filter outputs for 5 dB (dashed line) and 20 dB (solid line) SNR cases. (c) Detected endpoints and normalized energy for the 20 dB SNR case and (d) for the 5 dB SNR case.
Database Evaluation • The proposed algorithm was compared with a baseline endpoint detection algorithm on one noisy database and several telephone databases. • Baseline Endpoint Detection: • six-state transition diagram is used • initializing, silence, rising, energy, fell-rising, and fell states. • In total, eight counters and 24 hard-limit thresholds are used for the decisions of state transition.
Database Evaluation • Noisy Database Evaluation: • In this experiment, a database was first recorded from a desktop computer at 16 KHz sampling rate, then down-sampled to 8 KHz sampling rate. • Car and other back ground noises were artificially added to the original database at the SNR levels of 5, 10, 15, and 20 dB. • The original database has 39 utterances and 1738 digits in total. • LPC feature and the short-term energy were used and the hidden Markov model (HMM) to recognize.
Database Evaluation (a) utterance in DB5: “1 Z 4 O 5 8 2.” (b) baseline, recognized as “1 Z 4 O 5 8.” (c) proposed, recognized as “1 Z 4 O 5 8 2.” (d) filter output Comparisons on real-time connected digit recognition
Database Evaluation • Telephone Database Evaluation: • The proposed algorithm was further evaluated in 11 databases collected from the telephone networks with 8 kHz sampling rates in various acoustic environments. • DB1 to DB5 contain digits, alphabet and word strings. • DB6 to DB11 contain pure digit strings. • In the proposed system, we set the parameters as
Database Evaluation digits, alphabet and word strings pure digit strings
CONCLUSIONS • Since the entire algorithm only uses a 1-D energy feature, it has low complexity and is very fast in computation.