290 likes | 310 Views
This talk discusses audio source separation techniques such as beamforming, ICA, and CASA. It covers frequency domain frameworks, using geometric information, and examples of ICA as a beamformer. The talk also explores computational tools for audio source separation including CASA and Blind Source Separation (BSS). It reviews ICA and its use in audio source separation, as well as different approaches to addressing the permutation problem. The talk concludes with a discussion on beamforming for source separation and its application in real room environments.
E N D
Audio Source Separation And ICA by Mike Davies & Nikolaos Mitianoudis Digital Signal Processing Lab Queen Mary, University of London
Outline of Talk • Introduction: • Audio Source Separation: Beamforming, ICA & CASA ICA for source separation • dealing with convolutive mixtures A Frequency Domain Framework • unmixing in the frequency domain • source modelling & the permutation problem • Beamforming for Source Separation • Using geometric information • A Reverberant BSS Example: • ICA as a beamformer • Real reverberant transfer functions • Using beamforming with ICA • Moving sources
Computational tools for audio source separation • Computational Auditory Scene Analysis (CASA) • Typically extracts one source from a single channel of audio using heuristic psychological grouping rules (pattern matching). • Blind Source Separation (BSS aka ICA) • Uses spatial diversity based on source independence. Extensions include: convolutional mixing, overcomplete mixtures. • Beamforming • Uses spatial diversity based on the known geometry of the microphone array and the directions of arrival (DOA) of the source signals
Review of ICA • The ICA model: • Aim: estimate s(t) from x(t), (mixing matrix A unknown). • If no. of sources = no. of observations, we can estimate s(t) by estimating W = A-1 to give s = Wx. A is identifiable if we assume the sources are statistically independent: • and non-Gaussian.
ICA for Audio Source Separation Audio observations are linear convolution (plus additive noise) Unmixing filter uses an FIR approximation (complete case):
X(ω) S(ω) L point STFT L point STFT L point STFT L point STFT W1 L point ISTFT L point STFT . . . x1 . . . . . . . . . . . . s1 x2 s2 x3 s3 Wn Frequency (subband) filtering The unmixing filtering can be efficiently performed within a subband framework. This does not necessarily imply a frequency domain model for the sources.
ML Natural Gradient Algorithm • Various authors have suggested the simple gradient-based algorithm for ICA: • This can be viewed as a Maximum Likelihood estimator with • (s) often takes a tanh-like shape superGaussian prior. • For convolutive mixing this can be adapted to: • (time domain source model)
X(ω) S(ω) L point STFT L point STFT L point STFT L point STFT W1 L point ISTFT L point STFT . . . x1 . . . . . . . . . . . . s1 x2 s2 x3 s3 Wn Source model Frequency (subband) filtering STFT Time domain modelling e.g Lee et al. 1997
Frequency Domain Source Model • An alternative strategy is to model the sources in the frequency domain (e.g. Smaragdis 1997). • Advantages: • Computational Efficiency • Sparser Statistics (→ better estimates)
X(ω) S(ω) L point STFT L point STFT L point STFT L point STFT W1 L point ISTFT L point STFT . . . x1 . . . . . . . . . . . . s1 x2 s2 x3 s3 Wn Frequency (subband) filtering Frequency domain modelling (e.g Smaragdis 1997). Disadvantage: The Permutation Problem.
Solutions to Permutation Problem Source Modelling Solutions Time Domain no permutation problem (Lee et al. 1997). Time-Frequency couples adaptive filters, • using signal envelopes (Ikeda et al. 1999) or • TF generative models (Mitianoudis & Davies 2001). Permutation problem can persist with gradient learning (Davies 2002). Channel Modelling Solutions Constrained Unmixing Filters couples adaptive filters • Heuristic (Smaragdis 1997) • Constrained filter model (Parra & Spence 1998) Solutions tend to get trapped in local minima (Ikram & Morgan 2000) • Directivity patterns to resolve permutation (Kurita et al. 2000) Problems at high frequencies and with high reverberation
Permutation Problem Example Mitianoudis & Davies Alg. Smaragdis Alg. Two speech signals mixed with a single echoes of about ~ 5ms
Beamforming for Source Separation A traditional approach to microphone array processing is to use Beamforming. Microphone outputs are combined to amplify signals from desired direction while suppressing other signals from other directions. Hence ICA is a blind beamformer! Note beamformer directivity patterns are frequency dependent Narrowband beamformer directivity pattern Main lobe nulls Direction of Arrival
ICA as a Beamformer θ d FD-ICA is essentially a FD-Beamformer, i.e. place nulls to other sources, so as to separate one at a time. Null direction ICA employs statistical information only Beamforming employs geometrical info, i.e. Directions Of Arrival (DOA) One can perform permutation alignment for FD-ICA using DOA, i.e. align the directivity patterns.
Ideal Directivity Patterns Single Delay transfer function ~ anechoic room Ideal situation for permutation alignment Multiple ripples around c/d Hz A null around 25°
A real room experiment We recorded a 2 microphone - 2 speaker setup in a real lecture room, to explore the application of beamforming on BSS. ~ 7.5m 1.5m ~ 6m 2m 1m
Real Directivity Patterns Directivity pattern for source 1, estimated and aligned by Likelihood Ratio (amplitude only criterion). Observations • More smeared than single delay. • A main DOA still apparent. Questions • How can we accurately estimate DOA • from a directivity pattern ? • How can align the permutation to form • a consistent beam-pattern? • Can we approximate with a single delay ? DOA around 22°
DOA estimation ambiguity Multiple nulls appear after c/d Hz Difficult to estimate DOA. • Saruwatari et al used null • statistics along all frequencies • to estimate DOA. • Ikram and Morgan used only • lower frequencies to estimate • DOA. • Estimate the average along • frequency directivity pattern for • several frequency bands. • The average directivity pattern • between 0-2KHz can give a • consistent DOA.
DOA estimation ambiguity (cont) • The exact low-frequency range is dependent on d. multiple nulls appear at higher frequencies For small d recorded signals will be more similar => low separation quality Sensor spacing choice is a trade-off between separation quality and beampattern clarity. • For more accurate DOA estimation, one can use extra sensors and subspace methods like MuSIC. (Parra and Alvino 2002)
Permutation alignment using DOA • Basic Problem: • The nulls are slightly drifted around the DOA, due to reverberation. • Solution: • Look for a null in a “neighbourhood” around the DOA. • Not accurate enough. • Definition of neighbourhood. • Classification really difficult in mid-higher frequencies. Remedy: Use beamforming (phase information) in lower-mid frequencies and LR (amplitude information) for mid-higher frequencies.
Permutation alignment using DOA Sound Samples: Mixtures: Separated : using LR: Using BF:
Sensitivity analysis Effects of a misplaced beamformer: Repeated the recordings with source 2 misplaced by 50 cm. Beamformer’s sensitivity to movement We unmixed the 50cm recordings and compared the beampatterns. We observed the following:
Sensitivity analysis (cont) A moving source will not greatly affect our beamformer at lower frequencies, but mainly at higher frequencies. @160Hz @750Hz
Sensitivity analysis (cont) Distortion introduced due to movement. We used the original filters to unmix the 50cm case. Distortion is a function of frequency.
Sensitivity analysis (cont) Distortion introduced due to movement. The source that moved can still be separated, but is a bit more echoic due to incorrect mapping. The source that didn’t move won’t be separated due to incorrect beamforming. It will still be mapped back correctly.
Conclusions Beamforming is a useful tool for permutation alignment. It is a semi-blind method since it exploits known array configuration. Problems when aligning at higher frequencies. Phase information seems more important at lower frequencies. Amplitude information seems more important at higher frequencies. (Lord Rayleigh’s Law of Hearing) Distortion introduced due to movement is a function of frequency.