A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds Paris Smaragdis, MadhusudanaShashanka, Bhiksha Raj NIPS 2009

Introduction • Problem : Single channel signal separation • Separating out signals from individual sources in a mixed recording • General approach • Derive a generalizable model that captures the salient features of each source • Separation is achieved by abstracting components from the mixed signal that conform to the characterization of the individual sources

Physical Intuition Recover sources by reweighting of frequency subbands from a single recording

Latent Variable Model • Given magnitude spectrogram of a single source, each spectral frame is modeled as a histogram of repeated draws from a multinomial distribution over the frequency bins • At a given time frame t, Pt(f) represents the probabilty of drawing frequency f • The model assumes that Pt(f) is comprised of bases indexed by a latent variable z

Latent Variable Model (Contd.) • Now let the matrix VF×T of entries vft represent the magnitude spectrogram of the mixture sound and vt represent time frame t (the t-th column vector of matrix V) • First we assume that we have an already trained model in the form of basis vector Ps (f/z) • These bases represent a dictionary of spectra that best describe each source

Source separation • Decompose a new mixture of these known sources in terms of the contributions of the dictionaries of each source • Use EM algorithm to estimate Pt (z/s) and Pt(s) • The reconstruction of the contribution of source s in the mixture is given by

Contribution of this paper • Use training data directly as a dictionary • Authors argue that given any sufficiently large collection of data from a source the best possible characterization of any data is quite simply the data themselves (e.g., non-parametric density learning using Parzen-window) • Side-step the need for separate model training step • Large dictionary provides a better description of the sources, as opposed to the less expressive learned basis models • Source estimates are guaranteed to lie on the source manifold as opposed to trained approaches which can produce arbitrary outputs that will not necessarily be plausible source estimates

Using Training data as Dictionary • Use each frame of the spectrograms of the training sequences as the bases Ps(f/z) • Let be the training spectrogram from source s. In this case, the latent variable z for source s takes T(s) values, and the z-th basis function will be given by the z-th column vector of W(s) • With the above model ideally one would want to use one dictionary element per source at any point of time • Ensure output lie on the source manifold • Similar to a nearest neighbor model (search is computationally very expensive) • In this paper authors propose using sparsity

Entropic prior • Given a probability distribution θ the entropic prior is defined as • α is a weighting factor and determines the level of sparsity • A sparse representation has a low entropy (since only few elements are ‘active”) • Imposing this prior during MAP estimation is a way to minimize entropy during estimation which will result in sparse θ representation

Sparse approximation • We would like to minimize the entropies of both the speaker dependent mixture weights and the source priors at every frame • However, • Thus reducing the entropy of the joint distribution is equivalent to reducing the conditional entropy of the source dependent mixture weights and the entropy of the source priors

Sparse approximation • The model written in terms of this parameter is given by, • To impose sparsity we apply the entropic prior given by, • Apply EM to estimate • Reconstructed source is given by,

Results on real data

Comments • The use of sparsity ensures that the output is a plausible speech signal devoid of artifacts like distortion and musical noise • Unfortunate side effect is the need to use a very large dictionary • However significant reduction in dictionary size may be achieved by using an energy threshold to select the loudest frames of he training spectrogram as bases • Outperforms trained basis models of same size

A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds