Kenichi Kumatani , Disney Research, Pittsburgh Bhiksha Raj, Carnegie Mellon University

Microphone Array Post-filter based on Spatially-Correlated Noise Measurements for Distant Speech Recognition Kenichi Kumatani, Disney Research, Pittsburgh BhikshaRaj, CarnegieMellon University Rita Singh, CarnegieMellonUniversity John McDonough CarnegieMellonUniversity

Our Goal: Distant Speech Recognition (DSR) Backgrounds Conventional Post-filtering Methods Motivations Our Post-filtering Method DSR Experiments on Real Array Data Conclusions Organization of Presentation

Our Goal ~ Distant Speech Recognition (DSR) System Goal: Replace the close-talking microphone with the far-field sensors to make human-machine interfaces more interactive. Overview of our DSR System: Speaker Tracking Speaker’s position Distant speech Beamforming Post-filtering Enhanced speech Microphone array Speech Recognition Merits of this Approach: Recognition result • By using the geometry of the microphone array and speaker’s position, • our system has the following merits: • stable performance in real environments and • straightforward extension to the use of other information sources. Avoid being blind!

Backgrounds of this Work Backgrounds : • Beamforming would not provide the optimal solution in a sense of the minimum mean square error (MMSE). • Post-filtering can further improve speech recognition performance. Basic Block Chart: Key issue: • Estimate the power spectral densities (PSD) of target and noise signals • to build the Wiener filter. Beamforming Multi-channel Input Vector X Time Delay Compensation Post-filtering H Post-filter Estimation

Conventional Post-filter Design Method 1 Zelinski Post-filter : • Zelinski assumed that • The target and noise signals are uncorrelated, • The noise signals are uncorrelated between different channels, and • The noise PSD is the same among all the channels. • Then, the cross- and auto- spectral densities between two channels can be simplified as 0 0 0 • By substituting them into the Wiener filter formulation, we have the Zelinski post-filter:

Conventional Post-filter Design Method 2 Issues of the Zelinski Post-filter : • In many situations, the noise signals are spatially correlated. McCowan Post-filter : • McCowan and Bourlard introduced the coherence of the diffuse noise field: an indicator of the similarity of signals at different positions and compute the cross- and auto- spectral densities as This is different from the Zelinski method. • Then, the McCowan post-filter can be written as where is an PSD estimate of the target signal for each sensor pair. Lefkimmiatis Post-filter: • Lefkimmiatis et al. more accurately model the diffuse noise field by applying the coherence to the denominator of the McCowan post-filter.

Motivation of our Method Common Problem of Conventional Methods: • The static noise field model will not match to every situation. Example of Noise Coherence in a Car: • Figures show the magnitude-squared coherence observed in a car. Engine idling State Driving at a speed of 65 mph • It is clear that the actual noise field is neither uncorrelated nor diffuse field. Our Motivation: measure the most dominant noise signal instead of those static noise field assumptions.

Our Strategy- How can we measure a noise signal? Estimate a speaker’s position, Build a beamformer and steer a beam toward the target source, Find where the most dominant interfering source is, and Build another beamformer to measure a noise signal. Noise Speaker Steering direction for the noise source microphones Beamformer 2 (Noise Extractor) Beamformer 1 for the target speech Enhanced speech Separated noise Post-filter Further Noise Removal

Our Post-filter System • We build a maximum negentropy beamformer for a target source and • null-steering beamformer for extracting the noise signal. Maximum Negentropy Beamformer X wSD Hp H - B wa H H For the target source Null-steering Beamformer Post-filter estimation wnull H For the noise source

Our Post-filter System- Maximum Negentropy (MN) Beamformer (Speech emphasizer) MN Beamformer for the target source X wSD Hp H - B wa H H Maximum Negentropy Criterion: For the noise source • The distribution of clean speech is non-Gaussian and • that of noisy and reverberant speech becomes Gaussian. • Negentropy is an indicator of how far the distribution of signals is from Gaussian. Post-filter estimation wnull H Maximum Negentropy Beamformer: • Build a super-directive beamformer for the quiescent vector wSD. • Compute the blocking matrix Bto maintain the distortionless constraint for the look direction BHwSD= 0. • Find the active weight vector which provides the maximum negentropy of the outputs: • wa= argmaxYSDMN=(wSD- B wa)HX. Advantage: • We can enhance a structured-information signal coming from the direction of interest without signal cancelation and distortion.

Our Post-filter System- Null-Steering Beamformer (Noise extractor) X wSD Hp H - B wa H H For the noise source Post-filter estimation wnull H Null-steering Beamformer (Noise Extractor): • Place a null on the direction of interest (DOI) while maintaining the unity gain for the direction of the noise source. • Assuming the array manifold vectors for the target source vand for the noise source vN, • we obtain such a beamformer’s weight by solving the linear equation: • [ v vN]H wnull= [ 0 1 ]T. Advantage: • We can extract a noise signal only by eliminating the target signal arriving directly from the source point.

Our Post-filter System For the target source X wSD Hp H - B wa H H For the noise source Post-filter estimation wnull H Our post-filter design: • Now that we have estimates of the target signalYSDMN=(wSD- B wa)HX and • an noise observation Ynull = wnullX, H We can design the post-filter as

Distant Speech Recognition Experiments

Speech Recognition ResultsWord Error Rates in Different Conditions Word Error Rate

Conclusions • We used actual noise measurements for the microphone array post-filter. • It turned out that the noise fields in car conditions are neither uncorrelated nor spherically isotropic (diffuse). • It has been demonstrated that our post-filter method can provide the best recognition performance among the popular post-filter methods. • This is because our method can update a noise PSD adaptively without any static noise coherence assumption.

Thank you

Speech Samples (65-Wind) Single Distant Channel Post-filtered Speech Extracted Noise Signal

Actual Speech Distribution ~ Super-Gaussian Distributions of clean speech with super-Gaussian distributions *The histograms are computed from the real part of actual subband samples. • The distribution of speech is not Gaussian but non-Gaussian. • It has “spikey” and “heavy-tailed” characteristics. How about maximizing a degree of super-Gaussianity?

Why do we need non-Gaussianity measures? The reasoning is briefly grounded on 2 points: • The distribution of independent random variables (r.v.s.) will approach Gaussian in the limit as more components are added. • Information-bearing signals havea structure which makes them predictable. If we want original independent components which bear information, we have to look for a signal that is not Gaussian. Distributions of clean and noise-corrupted speech Distributions of clean and reverberated speech • The distributions of noise-corrupted and reverberated speech are closer to the Gaussian than clean speech.

Negentropy Criterion for super-Gaussianity Definition of entropy: • Entropy of r.v. Y is defined as: • Entropy indicates a degree of uncertainty of information. Definition of negentropy: • Negentropy is defined as the difference between entropy of Gaussian and Super-Gaussian r.v.s: Entropy of Gaussian r.v Entropy of super-Gaussian r.v • Higher negentropy indicates how far the distribution of the r.v.s. is from Gaussian. • Negentropy is generally more robust than the other criterion.

Analysis of the MN Beamforming Algorithm Simulated environment by the image method Target source Image The signal cancellation will occur because of the strong reflection. 30° 4m 70.9° Reflection Observe that MN beamforming can enhance the target signal by strengthening the reflection, which suggests it does not suffer from the signal cancellation. 650Hz 1600Hz

Measures for non-Gaussianity • Negentropy • Empirical kurtosis Definition of kurtosis: Kurtosis of r.v. is defined as: is positive value • Super-Gaussian: positive kurtosis, • Sub-Gaussian: those with negative kurtosis, • The Gaussian pdf : zero kurtosis. Kurtosis can measure the degree of non-Gaussianity. Empirical approximation of kurtosis: where K is the number of frames.

Kenichi Kumatani , Disney Research, Pittsburgh Bhiksha Raj, Carnegie Mellon University

Kenichi Kumatani , Disney Research, Pittsburgh Bhiksha Raj, Carnegie Mellon University

Presentation Transcript

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Raj Reddy Carnegie Mellon University Pittsburgh, PA 15213 January 21, 2010 rr.cs.cmu

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University

Carnegie Mellon University