510 likes | 651 Views
Histogram-based Quantization for Distributed / Robust Speech Recognition. Chia-yu Wan, Lin-shan Lee College of EECS, National Taiwan University, R. O. C. 2007/08/16. Outline. Introduction Histogram-based Quantization (HQ) Joint Uncertainty Decoding (JUD) Three-stage Error Concealment (EC)
E N D
Histogram-based Quantization for Distributed / Robust Speech Recognition Chia-yu Wan, Lin-shan Lee College of EECS, National Taiwan University, R. O. C. 2007/08/16
Outline • Introduction • Histogram-based Quantization (HQ) • Joint Uncertainty Decoding (JUD) • Three-stage Error Concealment (EC) • Conclusion
Mismatch between fixed VQ codebook and test data increases distortion Problems of Distance-based VQ • Conventional Distance-based VQ (e.g. SVQ) was popularly used in DSR • Dynamic Environmental noise and codebook mismatch jointly degrade theperformance of SVQ Noise moves clean speech to another partition cell (X to Y) Quantization increases difference between clean and noisy features • Histogram-based Quantization (HQ) is proposed to solve the problems
T Histogram-based Quantization (HQ) • Decision boundaries yi{i=1,…,N} are dynamically defined by C(y). • Representative values zi {i=1,…,N} are fixed, transformed by a standard Gaussian.
T Histogram-based Quantization (HQ) The actual decision boundaries (horizontal scale) for xt are dynamically defined by the inverse transformation of C(y).
T Histogram-based Quantization (HQ) • With histogram C’(y’), decision boundaries automatically changed to . • Decision boundaries are adjusted according to local statistics, no codebook mismatch problem.
T Histogram-based Quantization (HQ) • Based on CDF on the vertical scale and histogram, less sensitive to noise on the horizontal scale • Disturbances are automatically absorbed into HQ block • Dynamic nature of HQ • hidden codebook on vertical scale • transformed by dynamic C(y) • {yi} Dynamic on horizontal scale
Discussions about robustness of Histogram-based Quantization (HQ) • Distributed speech recognition: SVQ v.s. HQ • Robust speech recognition: HEQ v.s. HQ
Fixed codebook cannot well represent the noisy speech Dynamically adjusted to local statistics, no codebook mismatch Quantization increases difference between clean and noisy speech. Inherent robust nature, noise disturbances automatically absorbed by C(y) Comparison of Distance-based VQand Histogram-based Quantization (HQ) • HQ solves the major problems of conventional Distance-based VQ
HEQ (Histogram Equalization) v.s. HQ (Histogram-based Quantization) • HEQ performed point-to-point transformation • point-based order-statistics are more disturbed • HQ performed block-based transformation • automatically absorbed disturbance within a block • with proper choice of block size, block uncertainty can be compensated by GMM and uncertainty decoding • Averaged normalized distance between clean and corrupted speech features based on AURORA 2 database
HEQ (Histogram Equalization) v.s. HQ (Histogram-based Quantization) • HEQ performed point-to-point transformation • point-based order-statistics are more disturbed • HQ performed block-based transformation • automatically absorbed disturbance within a block • with proper choice of block size, block uncertainty can be compensated by GMM and uncertainty decoding • HQ gives smaller d for all SNR condition • less influenced by the noise disturbance
Further analysisBit rates v.s. SNR Clean-condition training multi-condition training
Client Server HQ-JUD • For both robust and/or distributed speech recognition • For robust speech recognition • HQ is used as the front-end feature transformation • JUD as the enhancement approach at the backend recognizer • For Distributed Speech Recognition (DSR) • HQ is applied at the client for data compression • JUD at the server Front-end Back-end Robustness DSR
Joint Uncertainty Decoding (1/4)- Uncertainty Observation Decoding w: observation, o: uncorrupted features Assume • HMM would be less discriminate on features with higher uncertainty • Increasing larger variance for more uncertain features
Joint Uncertainty Decoding (2/4)- Uncertainty for quantization errors • Codeword is the observation w • Samples in the partition cell are the uncorrupted features o • p(o) is the pdf of the samples within the partition cell Variance of samples within partition cell
More uncertain regions Loosely quantized cells Joint Uncertainty Decoding (2/4)- Uncertainty for quantization errors • Codeword is the observation w • Samples in the partition cell are the possible distribution o • p(o) is the pdf of the samples within the partition cell Variance of samples within partition cell • Increases the variances for the loosely quantized cells
Joint Uncertainty Decoding (3/4)-Uncertainty for environmental noise Histogram shift • Increase the variances for HQ features with a larger histogram shift
Joint Uncertainty Decoding (4/4) • Jointly consider the uncertainty caused by both the environmental noise and the quantization errors. • One of the above two would dominate • Quantization errors (High SNR) • Disturbance absorbed into HQ block • Environment noise (Low SNR) • Noisy features moved to another partition cells
Client HEQ-SVQ Server UD Client HQ Server JUD Client HEQ-SVQ Client HQ HQ-JUDfor distributed speech recognition • Different types of noise, averaged over all SNR values
Client HEQ-SVQ Server UD Client HEQ-SVQ HQ-JUDfor distributed speech recognition • Different types of noise, averaged over all SNR values • HEQSVQ-UD was slightly worse than HEQ for set C
Client HQ Server JUD Client HQ HQ-JUDfor distributed speech recognition • Different types of noise, averaged over all SNR values • HEQSVQ-UD was slightly worse than HEQ for set C • HQ-JUD consistently improved the performance of HQ
Client HEQ-SVQ Client HQ HQ-JUDfor distributed speech recognition • Different types of noise, averaged over all SNR values • HQ performed better than HEQ-SVQ for all types of noise
Client HQ Server JUD HQ-JUDfor distributed speech recognition Client HEQ-SVQ Server UD • Different types of noise, averaged over all SNR values • HQ performed better than HEQ-SVQ for all types of noise • HQ-JUD consistently performed better than HEQSVQ-UD
Client HQ Server JUD HQ-JUDfor distributed speech recognition • Different SNR conditions, averaged over all noise types Client SVQ Server UD Client HEQ-SVQ Server UD Client HQ Server JUD • HQ-JUD significantly improved the performance of SVQ-UD • HQ-JUD consistently performed better than HEQSVQ-UD
Stage 1 : error detection • Frame-level error detection • The received frame-pairs are first checked with CRC • Subvector-level error detection • The erroneous frame-pairs are then checked by the HQ consistency check • The quantized codewords for HQ represent the order-statistics information of the original parameters • Quantizaiton process does not change the order-statistics • Re-perform HQ on received subvector codeword should fall in the same partition cell
Stage 1 : error detection • Noise seriously affects the SVQ with data consistency check -precision degradation (from 66% at clean down to 12% at 0 dB) • HQ-based consistency approach is much more stable at all SNR values, - both recall and precision rates are higher.
Stage 2 : reconstruction • Based on the Maximum a posterior (MAP) criterion -Considering the probability for all possible codewords St(i) at time t, given the current and previous received subvector codewords, Rt and Rt-1, -prior speech source statistics : HQ codeword bigram model -channel transition probability : the estimated BER from stage1 -reliability of the received subvectors : consider the relative reliability between prior speech source and wireless channel prior channel
Stage 2 : reconstruction • Channel transition probability P(Rt | St(i)) -significantly differentiated (for different codeword i, with different d) when Rt is more reliable (BER is smaller) -put more emphasis on prior speech source when Rt is less reliable -the estimated BER is the number of inconsistent subvectors in the present frame divided by the total number of bits in the frame
Stage 2 : reconstruction • Prior source information P(St (i)| Rt-1) -based on the codeword bi-gram trained from cleaning training data in AURORA 2 -HQ can estimate the lost subvectors more preciously than SVQ -The conditional entropy measure
Stage 3 : Compensation in Viterbi decoding • The distribution of P(St (i)|Rt ,Rt-1) characterizes the uncertainty of the estimated features • Assume the distribution P(St (i)|Rt ,Rt-1) is Gaussian, the variance of the distribution P(St (i)|Rt ,Rt-1) is used in Uncertainty Decoding • Make the HMMs less discriminative for the estimated subvectors with higher uncertainty
HQ-based DSR system with transmission errors • Features corrupted by noise are more susceptible to transmission errors • For SVQ, 98% to 87% (clean), 60% to 36% (10 dB SNR)
HQ-based DSR system with transmission errors • The improvements that HQ offered over HEQ-SVQ when transmission errors were present are consistent and significant at all SNR values • HQ is robust against both environmental noise and transmission errors
Analyze the degradation of recognition accuracy caused by transmission errors • Comparison of SVQ, HEQ-SVQ and HQ for the percentage of words which were correctly recognized if without transmission errors, but incorrectly recognized after transmission.
HQ-Based DSR with Wireless Channels and Error Concealment g: GPRS r: ETSI repetition c: three-stage EC • ETSI repetition technique actually degraded the performance of HEQ-SVQg • the whole feature vectors including the correct subvectors are replaced by inaccurate estimations
HQ-Based DSR with Wireless Channels and Error Concealment g: GPRS r: ETSI repetition c: three-stage EC • Three-stage EC improved the performance significantly for all cases. • Robust against not only transmission errors, but against environmental noise as well.
Conclusions • Histogram-based Quantization (HQ) is proposed • a novel approach for robust and/or distributed speech recognition (DSR) • robust against environmental noise (for all types of noise and all SNR conditions) and transmission errors • For future personalized and context aware DSR environments • HQ can be adapted to network and terminal capabilities • with recognition performance optimized based on environmental conditions