450 likes | 562 Views
Data-Adaptive Source Separation for Audio Spatialization. M. Tech. project presentation. by Pradeep Gaddipati 08307029. Super visors: Prof. Preeti Rao and Prof. V. Rajbabu. Outline. Problem statement Audio spatialization Source separation Data-adaptive TFR
E N D
Data-Adaptive Source Separation for Audio Spatialization M. Tech. project presentation by PradeepGaddipati 08307029 Supervisors: Prof. PreetiRao and Prof. V. Rajbabu
Outline • Problem statement • Audio spatialization • Source separation • Data-adaptive TFR • Concentration measure (sparsity) • Re-construction of signal from TFR • Performance evaluation • Data-adaptive TFR for sinusoid detection • Conclusions and future work
Problem statement • Spatial audio – surround sound • commonly used in movies, gaming, etc. • suspended disbelief • applicable when the playback device is located at a considerable distance from the listener • Mobile phones • headphones – for playback • spatial audio – ineffective over headphones • lacks body reflection cues – in-the-head localization • can‘t re-record – so need for audio spatialization
Audio spatialization • Audio spatialization – a spatial rendering technique for conversion of the available audio into desired listening configuration • Analysis – separating individual sources • Re-synthesis – re-creating the desired listener-end configuration
Source separation Source 1 Mixtures (stereo) Source 2 Source 3 • Source separation – obtaining the estimates of the underlying sources, from a set of observations from the sensors • Time-frequency transform • Source analysis – estimation of mixing parameters • Source synthesis – estimation of sources • Inverse time-frequency representation
Mixing model • Anechoic mixing model • mixtures, xi • sources, sj • Under-determined (M < N) • M = number of mixtures • N = number of sources • Mixing parameters • attenuation parameters, aij • delay parameters, Figure: Anechoic mixing model – Audio is observed at the microphones with differing intensity and arrival times (because of propagation delays) but with no reverberations Source: P. O. Grady, B. Pearlmutter and S. Rickard, “Survey of sparse and non-sparse methods in source separation,” International Journal of Imaging Systems and Technology, 2005.
Source analysis (estimation of mixing parameters) • Time-frequency representation of mixtures • Requirement for source separation [1] • W-disjoint orthogonality
Source analysis (estimation of mixing parameters) • For every time-frequency bin • estimate the mixing parameters [1] • Create a 2-dimensional histogram • peaks indicate the mixing parameters
Source synthesis (estimation of sources) Source 1 Mixture Source 2 Source 3 Masks Sources
Source synthesis (estimation of sources) Source 1 Mixture Source 2 Source 3
Source synthesis (estimation of sources) • Source estimation techniques • degenerate unmixing technique (DUET) [1] • lq-basis pursuit (LQBP) [2] • delay and scale subtraction scoring (DASSS) [3]
Source synthesis (DUET) • Every time-frequency bin of the mixture is assigned to one of the source based on the distance measure
Source synthesis (LQBP) • Relaxes the assumption of WDO – assumes at most ‘M’ sources present at each T-F bin • M = no. of mixtures, N = no. of sources, (M < N) • lq measure decides which ‘M’ sources are present
Source synthesis (DASSS) • Identifies which bins have only one dominant source • uses DUET for that bins • assumes at most ‘M’ sources present in rest of the bins • error threshold decides which ‘M’ sources are present
Inverse time-frequency transform Mixtures (stereo) Orig. source 1 Est. source 1 Orig. source 2 Est. source 2 Orig. source 3 Est. source 3
Scope for improvement • Requirement for source separation • W-disjoint orthogonality (WDO) amongst the sources • Sparser the TFR of the mixtures [4] • the less will be the overlap amongst the sources (i.e. higher WDO) • easier will be their separation
Data-adaptive TFR • For music/speech signals • different components (harmonic/transients/modulations) at different time-instants • best window differs for different components • this suggests use of data-dependent time-varying window function to achieve a high sparsity [6] • To obtain sparser TFR of mixture • use different analysis window lengths for different time-instants, the one which gives maximum sparsity
Data-adaptive TFR Data-adaptive time-frequency representation of singing voice, window function = hamming window sizes = 30, 60 and 90 ms, hop size = 10 ms, conc. measure = kurtosis
Sparsity measure(concentration measure) • What is sparsity ? • small number of coefficients contain a large proportion of the energy • Common sparsity measures [5] • Kurtosis • Gini Index • Which sparsity measure to use for adaptation ? • the one which shows the same trend as WDO as a function of analysis window size
WDO and sparsity (some formulae) • W-disjoint orthogonality [4] • Kurtosis • Gini Index
Dataset description • Dataset : BSS oracle • Sampling frequency : 22050 Hz • 10 sets each of music and speech signals • One set : 3 signals • Duration : 11 seconds
WDO and sparsity • WDO vs. window size • obtain TFR of the sources in a set • obtain source-masks based on the magnitude of the TFRs in each of the T-F bins • using the source-masks and the TFR of the sources obtain the WDO measure • NOTE: In case of data-adaptive TFR, obtain the TFR of sources using the window sequence obtained from the adaptation of the mixture • Sparsity vs. window size • obtain the TFR of one of the channel of the source • calculate the frame-wise sparsity of the TFR of the mixture
WDO and sparsity (observations) • Highest sparsity (kurtosis/Gini Index) is obtained when data-adaptive TFR is used • Highest WDO is obtained by using data-adaptive TFR (with kurtosis as the adaptation) • Kurtosis is observed to have similar trend as that of WDO
Inverse data-adaptive TFR • Constraint (introduced by source separation) • TFR should be invertible • Solution • Select analysis windows such that they satisfy constant over-lap add (COLA) criterion [7] • Techniques • transition window • modified (extended) window
Problems with re-construction • Transition window technique • adaptation carried out only on alternate frames • WDO obtained amongst the underlying sources is less • Modified window technique • the extended window as compared to a normal hamming window has larger side-lobes • spreading the signal energy into neighboring bins • WDO measure decreases
Dataset description • Dataset – BSS oracle • Mixtures per set (72 = 24 x 3) • attenuation parameters (24 = 4P3) • {100, 300, 600, 800} • Delay parameters • {(0,0,0), (0, 1, 2), (0 2 1)} • A total of 720 (72 x 10) mixtures (test cases) for each of music and speech groups
Performance (source estimation) • Evaluate the source-masks using one of the source estimation techniques (DUET or LQBP) • Using the set of estimated source-masks and the TFRs of the original sources calculate the WDO measure of each of the source-masks • WDO measure indicates how well the mask • preserves the source of interest • suppresses the interfering sources
Data-adaptive TFR (for sinusoid detection) Data-adaptive time-frequency representation of a singing voice window function = hamming; window sizes = 20, 40 and 60 ms; hop size = 10 ms, concentration measure = kurtosis; frequency range = 1000 to 3000 Hz
Conclusions • Mixing model – anechoic • Kurtosis can be used as the adaptation criterion for data-adaptive TFR • Data-adaptive TFR provides higher WDO measure amongst the underlying sources as compared to fixed-window STFT • Better estimates of the mixing parameters and the sources are obtained using data-adaptive TFR • Performance of DUET is better than LQBP
Future work • Testing of the DASSS source estimation technique • Re-construction of the signal from TFR • Need to consider a more realistic mixing model to account for reverberation effects, like echoic mixing model
Acknowledgments I would like to thank Nokia, India for providing financial support and technical inputs for the work reported here
References • A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing n sources from 2 mixtures,” IEEE Conference on Acoustics, Speech and Signal Processing, 2000 • R. Saab, O. Yilmaz, M. J. Mckeown and R. Abugharbieh, “Underdetermined anechoic blind source separation via lq basis pursuit with q<1,” IEEE Transactions on Signal Processing, 2007 • A. S. Master, “Bayesian two source modelling for separation of N sources from stereo signal,” IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 281-284, 2004
References • S. Rickard, “Sparse sources are separated sources,” European Signal Processing Conference, 2006 • N. Hurley and S. Rickard, “Comparing measures of sparsity,” IEEE Transactions on Information Theory, 2009 • D. L. Jones and T. Parks, “A high resolution data-adaptive time-frequency representation,” IEEE Transactions on Acoustics, Speech and Signal Processing, 1990 • P. Basu, P. J. Wolfe, D. Rudoy, T. F. Quatieri and B. Dunn, “Adaptive short-time analysis-synthesis for speech enhancement,” IEEE Conference on Acoustics, Speech and Signal Processing, 2008
Thank you Questions ?