Spatial Covariance Models For Under-Determined Reverberant Audio Source Separation

Spatial Covariance Models For Under-Determined Reverberant Audio Source Separation N. Duong, E. Vincent and R. Gribonval METISS project team, IRISA/INRIA, France Oct. 2009

Content • Under-determined source separation • Spatial covariance models • Model parameter estimation • Experimental evaluation • Conclusion

Under-determined source separation • Use recorded mixture signals to separate sources , where • Convolutive mixing model: Denote the vector of mixing filters from source to microphone array, the contribution of to all microphones and the vector of mixture signals are computed as:

BSS approaches Short-term Fourier transform Sparsity assumption: only FEW sources are active at each time-frequency point Binary masking: only ONE source is active at each time-frequency point L1-norm minimization:

Beamforming model is denoted as and approximated by the distance between each source to microphones [T. Gustafsson et.al.], i.e. in stereo mixture: Covariance matrix of source images Spatial covariance matrix (rank 1) modeling the mixing process Source variance

Spatial covariance models • Purpose of the paper: explore the extension of Gaussian framework, i.e. and , that better account for reverberation • We evaluate potential separation performance by estimating the spatial model parameter from training data • Source separation by Wiener filtering • Models for spatial covariance matrix: • Rank-1 convolutive model • Rank-1 anechoic model • Full-rank direct+diffuse model • Full-rank unconstrained model.

Rank-1 models • Rank-1 anechoic model • Where is steering vector specified in the beamforming approach • Rank-1 convolutive model • Where is the Fourier transform of the mixing filters

Full-rank direct+diffuse model • Assuming that the direct part and the reverberant part are uncorrelated and the reverberant part is diffuse • where and can be specified from statistical room acoustic, i.e. depends on the microphone distance , wall area , and wall reflection coefficient • - In the rectangular room:

Full-rank unconstrained model - A more general model than the previous models where the coefficients of are not related a priori - Allows more flexible modeling of the mixing process since the reverberation part is rarely diffuse and is correlated with the direct part in practice - Expected to improve separation performance of real-world convolutive mixtures.

Model parameter estimation • We investigate the potential separation performance achievable via each model in: • Semi-blind context:Spatial covariance matrices are estimated from true source images but source variances are blindly estimated from the mixture in the ML sense • Where is the Kullback-Leibler (KL) divergence between the empirical covariance matrices and the model-based matrices. • Oracle context:Both and are estimated from the true source images.

Experiment • Purpose: - Compare the source separation performance of the model-based algorithms - Criteria: SDR, SIR, SAR Room dimensions: 4.45 x 3.35 x 2.5 m Source and microphone height: 1.4 m Microphone distance: d = 20 cm or 5 cm Source-to-microphone distance: 120 cm or 50 cm s2 s1 r m1 m2 1.8m • Experimental setup: - Speech length: 5 seconds - Sampling rate: 16 kHz - Sine window for STFT with length of 1024 taps 1.5m s3

Experimental result

Conclusion • - Proposed to model the convolutive mixing process by full-rankspatial covariance matrices • - Experimental results confirm that full-rank spatial covariance matrices better account for reverberation and potentially improve separation performance compared to rank-1 matrices. • Work in progress • - Validated the power of the proposed algorithms over real-world recordings with small source movement (demo session) • - Blind context: learning the model parameters from the recorded mixture (submitted to ICASSP 2010 ). • Future work: • - Consider separation of diffuse and semi-diffuse sources

Thanks for your attention!See you again in the demo session tonight & Your comments…?

Spatial Covariance Models For Under-Determined Reverberant Audio Source Separation