Histogram-based Quantization for Distributed / Robust Speech Recognition

Histogram-based Quantization for Distributed / Robust Speech Recognition Chia-yu Wan, Lin-shan Lee College of EECS, National Taiwan University, R. O. C. 2007/08/16

Outline • Introduction • Histogram-based Quantization (HQ) • Joint Uncertainty Decoding (JUD) • Three-stage Error Concealment (EC) • Conclusion

Mismatch between fixed VQ codebook and test data increases distortion Problems of Distance-based VQ • Conventional Distance-based VQ (e.g. SVQ) was popularly used in DSR • Dynamic Environmental noise and codebook mismatch jointly degrade theperformance of SVQ Noise moves clean speech to another partition cell (X to Y) Quantization increases difference between clean and noisy features • Histogram-based Quantization (HQ) is proposed to solve the problems

T Histogram-based Quantization (HQ) • Decision boundaries yi{i=1,…,N} are dynamically defined by C(y). • Representative values zi {i=1,…,N} are fixed, transformed by a standard Gaussian.

T Histogram-based Quantization (HQ) The actual decision boundaries (horizontal scale) for xt are dynamically defined by the inverse transformation of C(y).

T Histogram-based Quantization (HQ) • With histogram C’(y’), decision boundaries automatically changed to . • Decision boundaries are adjusted according to local statistics, no codebook mismatch problem.

T Histogram-based Quantization (HQ) • Based on CDF on the vertical scale and histogram, less sensitive to noise on the horizontal scale • Disturbances are automatically absorbed into HQ block • Dynamic nature of HQ • hidden codebook on vertical scale • transformed by dynamic C(y) • {yi} Dynamic on horizontal scale

Histogram-based Vector Quantization (HVQ)

Discussions about robustness of Histogram-based Quantization (HQ) • Distributed speech recognition: SVQ v.s. HQ • Robust speech recognition: HEQ v.s. HQ

Fixed codebook cannot well represent the noisy speech Dynamically adjusted to local statistics, no codebook mismatch Quantization increases difference between clean and noisy speech. Inherent robust nature, noise disturbances automatically absorbed by C(y) Comparison of Distance-based VQand Histogram-based Quantization (HQ) • HQ solves the major problems of conventional Distance-based VQ

HEQ (Histogram Equalization) v.s. HQ (Histogram-based Quantization) • HEQ performed point-to-point transformation • point-based order-statistics are more disturbed • HQ performed block-based transformation • automatically absorbed disturbance within a block • with proper choice of block size, block uncertainty can be compensated by GMM and uncertainty decoding • Averaged normalized distance between clean and corrupted speech features based on AURORA 2 database

HEQ (Histogram Equalization) v.s. HQ (Histogram-based Quantization) • HEQ performed point-to-point transformation • point-based order-statistics are more disturbed • HQ performed block-based transformation • automatically absorbed disturbance within a block • with proper choice of block size, block uncertainty can be compensated by GMM and uncertainty decoding • HQ gives smaller d for all SNR condition • less influenced by the noise disturbance

HQ as a feature transformation method

HQ as a feature quantization method

Further analysisBit rates v.s. SNR Clean-condition training multi-condition training

Client Server HQ-JUD • For both robust and/or distributed speech recognition • For robust speech recognition • HQ is used as the front-end feature transformation • JUD as the enhancement approach at the backend recognizer • For Distributed Speech Recognition (DSR) • HQ is applied at the client for data compression • JUD at the server Front-end Back-end Robustness DSR

Joint Uncertainty Decoding (1/4)- Uncertainty Observation Decoding w: observation, o: uncorrupted features Assume • HMM would be less discriminate on features with higher uncertainty • Increasing larger variance for more uncertain features

Joint Uncertainty Decoding (2/4)- Uncertainty for quantization errors • Codeword is the observation w • Samples in the partition cell are the uncorrupted features o • p(o) is the pdf of the samples within the partition cell Variance of samples within partition cell

More uncertain regions Loosely quantized cells Joint Uncertainty Decoding (2/4)- Uncertainty for quantization errors • Codeword is the observation w • Samples in the partition cell are the possible distribution o • p(o) is the pdf of the samples within the partition cell Variance of samples within partition cell • Increases the variances for the loosely quantized cells

Joint Uncertainty Decoding (3/4)-Uncertainty for environmental noise Histogram shift • Increase the variances for HQ features with a larger histogram shift

Joint Uncertainty Decoding (4/4) • Jointly consider the uncertainty caused by both the environmental noise and the quantization errors. • One of the above two would dominate • Quantization errors (High SNR) • Disturbance absorbed into HQ block • Environment noise (Low SNR) • Noisy features moved to another partition cells

HQ-JUDfor robust speech recognition

Client HEQ-SVQ Server UD Client HQ Server JUD Client HEQ-SVQ Client HQ HQ-JUDfor distributed speech recognition • Different types of noise, averaged over all SNR values

Client HEQ-SVQ Server UD Client HEQ-SVQ HQ-JUDfor distributed speech recognition • Different types of noise, averaged over all SNR values • HEQSVQ-UD was slightly worse than HEQ for set C

Client HQ Server JUD Client HQ HQ-JUDfor distributed speech recognition • Different types of noise, averaged over all SNR values • HEQSVQ-UD was slightly worse than HEQ for set C • HQ-JUD consistently improved the performance of HQ

Client HEQ-SVQ Client HQ HQ-JUDfor distributed speech recognition • Different types of noise, averaged over all SNR values • HQ performed better than HEQ-SVQ for all types of noise

Client HQ Server JUD HQ-JUDfor distributed speech recognition Client HEQ-SVQ Server UD • Different types of noise, averaged over all SNR values • HQ performed better than HEQ-SVQ for all types of noise • HQ-JUD consistently performed better than HEQSVQ-UD

Client HQ Server JUD HQ-JUDfor distributed speech recognition • Different SNR conditions, averaged over all noise types Client SVQ Server UD Client HEQ-SVQ Server UD Client HQ Server JUD • HQ-JUD significantly improved the performance of SVQ-UD • HQ-JUD consistently performed better than HEQSVQ-UD

Three-stage error concealment (EC)

Stage 1 : error detection • Frame-level error detection • The received frame-pairs are first checked with CRC • Subvector-level error detection • The erroneous frame-pairs are then checked by the HQ consistency check • The quantized codewords for HQ represent the order-statistics information of the original parameters • Quantizaiton process does not change the order-statistics • Re-perform HQ on received subvector codeword should fall in the same partition cell

Stage 1 : error detection • Noise seriously affects the SVQ with data consistency check -precision degradation (from 66% at clean down to 12% at 0 dB) • HQ-based consistency approach is much more stable at all SNR values, - both recall and precision rates are higher.

Stage 2 : reconstruction • Based on the Maximum a posterior (MAP) criterion -Considering the probability for all possible codewords St(i) at time t, given the current and previous received subvector codewords, Rt and Rt-1, -prior speech source statistics : HQ codeword bigram model -channel transition probability : the estimated BER from stage1 -reliability of the received subvectors : consider the relative reliability between prior speech source and wireless channel prior channel

Stage 2 : reconstruction • Channel transition probability P(Rt | St(i)) -significantly differentiated (for different codeword i, with different d) when Rt is more reliable (BER is smaller) -put more emphasis on prior speech source when Rt is less reliable -the estimated BER is the number of inconsistent subvectors in the present frame divided by the total number of bits in the frame

Stage 2 : reconstruction • Prior source information P(St (i)| Rt-1) -based on the codeword bi-gram trained from cleaning training data in AURORA 2 -HQ can estimate the lost subvectors more preciously than SVQ -The conditional entropy measure

Stage 3 : Compensation in Viterbi decoding • The distribution of P(St (i)|Rt ,Rt-1) characterizes the uncertainty of the estimated features • Assume the distribution P(St (i)|Rt ,Rt-1) is Gaussian, the variance of the distribution P(St (i)|Rt ,Rt-1) is used in Uncertainty Decoding • Make the HMMs less discriminative for the estimated subvectors with higher uncertainty

HQ-based DSR system with transmission errors • Features corrupted by noise are more susceptible to transmission errors • For SVQ, 98% to 87% (clean), 60% to 36% (10 dB SNR)

HQ-based DSR system with transmission errors • The improvements that HQ offered over HEQ-SVQ when transmission errors were present are consistent and significant at all SNR values • HQ is robust against both environmental noise and transmission errors

Analyze the degradation of recognition accuracy caused by transmission errors • Comparison of SVQ, HEQ-SVQ and HQ for the percentage of words which were correctly recognized if without transmission errors, but incorrectly recognized after transmission.

HQ-Based DSR with Wireless Channels and Error Concealment g: GPRS r: ETSI repetition c: three-stage EC • ETSI repetition technique actually degraded the performance of HEQ-SVQg • the whole feature vectors including the correct subvectors are replaced by inaccurate estimations

HQ-Based DSR with Wireless Channels and Error Concealment g: GPRS r: ETSI repetition c: three-stage EC • Three-stage EC improved the performance significantly for all cases. • Robust against not only transmission errors, but against environmental noise as well.

HQ-Based DSR with Wireless Channels and Error Concealment

Different client traveling speed (1/3)

Conclusions • Histogram-based Quantization (HQ) is proposed • a novel approach for robust and/or distributed speech recognition (DSR) • robust against environmental noise (for all types of noise and all SNR conditions) and transmission errors • For future personalized and context aware DSR environments • HQ can be adapted to network and terminal capabilities • with recognition performance optimized based on environmental conditions

Histogram-based Quantization for Distributed / Robust Speech Recognition

Histogram-based Quantization for Distributed / Robust Speech Recognition

Presentation Transcript

pliq.me mobile speech-to-text recognition service (russian)

Free Speech/1 st Amendment

Speech Recognition

Chapter 23

Enhancing Instruction of Written East Asian Languages with Sketch Recognition-Based “Intelligent Language Workbook” Inte

Why Inner Speech?

Deep Learning from Speech Analysis/Recognition to Language/Multimodal Processing

LINF2345: Languages and Algorithms for Distributed Applications

Robust speaker recognition over varying channels

Robust Translation of Spontaneous Speech: A Multi-Engine Approach

Chapter 22: Distributed Databases

A Tutorial on Bayesian Speech Feature Enhancement

Language models for speech recognition Bhiksha Raj and Rita Singh

Outline

Design and Implementation of Speech Recognition Systems

Feature Extraction for speech applications

Single and Multi Channel Feature Enhancement for Distant Speech Recognition

Conditional Random Fields for Automatic Speech Recognition

Novel Speech Recognition Models for Arabic

Introduction to Face Recognition and Detection

Robust PCA