240 likes | 578 Views
Advanced Microphone Array and ASR Integration. Professor: Yuan-Fu Liao. National Taipei University of Technology. Overview. Introduction Microphone Array and ASR Integration Noise - Phase Error Filtering Maximum Likelihood-based Integration Maximum Classification Error-like Integration
E N D
Advanced Microphone Array and ASR Integration Professor: Yuan-Fu Liao National Taipei University of Technology
Overview • Introduction • Microphone Array and ASR Integration • Noise - Phase Error Filtering • Maximum Likelihood-based Integration • Maximum Classification Error-like Integration • Reverberation - Subband Filtering-and-Sum • Maximum Likelihood-based Integration • Maximum Classification Error-based Integration • Summary 建議字型:中文微軟正黑體,英文Arial
Traditional Beamforming+ASR • Pipeline : first enhance speech with beamformer, then feed into recogniser
Bridge the Gap between Array and Speech Recognizer • Take the advantage of available a priori knowledge, i.e., the underline recognition model • Directly feed the output of recognizer back to microphone array 建議字型:中文微軟正黑體,英文Arial
References • Noise - dual-microphone phase error filtering • Shi, G., Aarabi, P. and Jiang, H., “Phase-Based Dual-Microphone Speech Enhancement Using A Prior Speech Model”, IEEE Trans. Audio Speech Lang. Process., 15:109-118, 2007. • C. Kim, K. Kumar, B. Raj, and R. M. Stern, “Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain,” In INTERSPEECH-2009, pp. 2495-2498, 2009. • Hsien-Cheng Liao, Yuan-Fu Liao and Chin-Hui Lee, Maximum Confidence Measure Based Interaural Phase Difference Estimation for Noise Masking in Dual-Microphone Robust Speech Recognition, InterSpeech 2011 • Reverberation - subband filtering-and-sum • M.L. Seltzer, B. Raj, R.M. Stern, “Likelihood-maximizing beamforming for robust hands-free speech recognition,” IEEE Trans. Speech, and Audio Processing, vol. 12, no. 5, pp. 489–498, Sep. 2004. • M.L. Seltzer, R.M. Stern, “Subband likelihood-maximizing beamforming for speech Recognition in Reverberant Environments,” IEEE Trans. Speech, and Audio Processing, vol. 14, no. 6, pp. 2109–2121, Nov. 2006. • Yuan-Fu Liao, I-Yun Xu: Subband minimum classification error beamforming for speech recognition in reverberant environments, ICASSP‘2010 建議字型:中文微軟正黑體,英文Arial
Signal Modeling(ITD) sampling rate: 8000Hz interaural time delay sound source △t 0.05 x cos (Φ) sound source Φ 0.05 m mic2 mic1 mic2 mic1 0.05 m 建議字型:中文微軟正黑體,英文Arial
Binary Masking 保留 去除 speaker interference FFT ITD < τ masking micR micL ITD > τ 建議字型:中文微軟正黑體,英文Arial
短時距傅立葉轉換 X-score 計算模組 雙耳時間差 計算模組 門檻值 調整模組 特徵向量 計算模組 語音命令模型 模型N+1 最大 X-score Optimal τestimation 語音辨識 至少一個 一階段 左麥克風訊號 右麥克風訊號 X-score輸出 門檻值輸入 yes no 輸出辨識結果/門檻值 自動 建議字型:中文微軟正黑體,英文Arial
Testing Database • 轉錄雙麥克風音檔錄音環境設定 • 無響室:5X4X3 m3 • 麥克風位置:無響室正中央 • 雙麥克風距離:5cm • 麥克風高度:1 m • 目標音源與雙麥克風中心距離:30cm • Babble雜訊音源角度:30o & 60o • 測試語料 • 50 commands (e.g. 向前、後退…) • 11 speakers (6 males & 5 females) • 547 utterances in total • Noise added artificially • SNR : 0,6,12,18 dB 建議字型:中文微軟正黑體,英文Arial
Recognition Model • Training Data • MAT2000 DB4 • Feature • 25 ms/frame without overlap • 13 Dims(8 ceps, 4 delta ceps, dC0) • Recognition Model • 100 2-state RCD Initials + 38 2-state CI Finals • 2 mixture Gaussians/state 建議字型:中文微軟正黑體,英文Arial
Performance of online τ estimation 30o db 60o db 建議字型:中文微軟正黑體,英文Arial
Reverberation - Subband Filtering-and-Sum • Introduction • Maximum Likelihood-based Integration • Maximum Classification Error-based Integration 建議字型:中文微軟正黑體,英文Arial
Introduction Reverberant Model Noise Free Model in Time Domain 建議字型:中文微軟正黑體,英文Arial
Speech Reverberation -Time Domain 建議字型:中文微軟正黑體,英文Arial
Speech Reverberation -Frequency Domain Clean Speech Noisy Speech 建議字型:中文微軟正黑體,英文Arial
Basic idea of LiMaBeam Iterative procedure, utterance-based: • Do beamforming • Decode the utterance • Given most likely HMM state sequence, optimise the beamformer parameters for this sequence • Stop when likelihood has converged
Subband Likelihood-Maximizing Beamforming 建議字型:中文微軟正黑體,英文Arial
Formulation 建議字型:中文微軟正黑體,英文Arial
TCC300 Reverberation Experiment • Experimental Setting • Microphone array with 7 microphones, 5.66 cm between two microphones • Speaker 2m away from the array • Room reverberation time T60=0.3~1.3 sec. • TCC300 database, 29 speakers, each with 5 calibration and 10 test utterances • Evaluation with free-syllable decoding/syllable error rate (no language model) • Experimental Results 建議字型:中文微軟正黑體,英文Arial
Typical Spectrum Examples Clean Speech Noisy Speech Delay-and-Sum MCE beamformer 建議字型:中文微軟正黑體,英文Arial
Summary • Take the advantage of available a priori knowledge, i.e., the underline recognition model • Directly feed the output of recognizer back to microphone array • Error rate criterion is better than likelihood 建議字型:中文微軟正黑體,英文Arial