Brian King , bbking@uw Advised by Les Atlas Electrical Engineering, University of Washington

Brian King, bbking@uw.edu Advised by Les Atlas Electrical Engineering, University of Washington This research was funded by Air Force Office of Scientific Research A Framework for Complex Probabilistic Latent Semantic Analysis and its Application to Single-Channel Source Separation

Problem Statement • Develop a theoretical framework for complex probabilistic latent semantic analysis (CPLSA) and its application in single-channel source separation Intro Background Current Proposed

Outline • Introduction • Background • My current contributions • Proposed work Intro Background Current Proposed

Nonnegative Matrix Factorization (NMF) Basis Index (k) Xf,t Bf,k X Wk,t Frequency (f) Frequency (f) Time (t) Basis Index (k) [1] D.D. Lee and H.S. Seung, “Algorithms for Non-Negative Matrix Factorization,” Neural Information Processing Systems, 2001, pp. 556--562. Intro Background Current Proposed

Using Matrix Factorization for Source Separation Xindiv STFT* Find Bases xindiv ISTFT** Y1 y1 Xmixed B, W Separation STFT* Find Weights Separation Y2 xmixed ISTFT** y2 *Short Time Fourier Transform **Inverse Short Time Fourier Transform Intro Background Current Proposed

Using Matrix Factorization for Synthesis / Source Separation B, W Y1 Source Separation Matrix Factorization Synthesis Y1 X Y2 W W1 Y1 B B1 B2 X W2 Y2 Basesf,k Synthesized Signalf,t Separated Signalsf,t Weightsk,t Intro Background Current Proposed

NMF Cost Function: Frobenius Norm with Sparsity where Frobenius2 L1 Sparsity Intro Background Current Proposed X Xf,t Bf,k Wk,t

Probabilistic Latent Semantic Analysis (PLSA) • Views the magnitude spectrogram as a joint probability distribution [2] M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic Latent Variable Models as Nonnegative Factorizations,” Computational Intelligence and Neuroscience, vol. 2008, 2008, pp. 1-9. Intro Background Current Proposed

Probabilistic Latent Semantic Analysis (PLSA) • Uses the following generative model • Pick a time, P(t) • Pick a base from that time, P(k|t) • Pick a frequency of that base, P(f|k) • Increment the chosen (f,t) by one • Repeat • Can be written as Intro Background Current Proposed

Probabilistic Latent Semantic Analysis (PLSA) • Relationship to NMF • P(t) is the sum of all magnitude at time t • P(k|t) similar to weight matrix Wk,t • P(f|k) similar to base matrix Bf,k • NMF • PLSA Intro Background Current Proposed

Probabilistic Latent Semantic Analysis • Advantage of PLSA over NMF: Extensibility • A tremendous amount of applicable literature on generative models • Entropic priors [2] • HMM’s with state-dependent dictionaries [6] [2] M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic Latent Variable Models as Nonnegative Factorizations,” Computational Intelligence and Neuroscience, vol. 2008, 2008, pp. 1-9. [6] G.J. Mysore, “A Non-Negative Framework for Joint Modeling of Spectral Structures and Temporal Dynamics in Sound Mixtures,” PhD Thesis, Stanford University, 2010. Intro Background Current Proposed

… but superposition? #1 #2 Original Sources Mixture Proper Separation !!! !!! NMF Separation Intro Background Current Proposed

CMF Cost Function: Frobenius Norm with Sparsity where Frobenius2 L1 Sparsity [3] H. Kameoka, N. Ono, K. Kashino, and S. Sagayama, “Complex NMF: A New Sparse Representation for Acoustic Signals,” International Conference on Acoustics, Speech, and Signal Processing, 2009. Intro Background Current Proposed X Xf,t Bf,k Wk,t

Comparing NMF and CMF via ASR: Introduction • Data • Boston University news corpus [7] • 150 utterances (72 minutes) • Two talkers synthetically mixed at 0 dB target/masker ratio • 1 minute each of clean speech used for training • Recognizers • Sphinx-3 (CMU) • SRI [7] M. Ostendorf, “The Boston University Radio Corpus,” 1995. Intro Background Current Proposed

Comparing NMF and CMF via ASR: Results Better Word Accuracy % Unprocessed Non-negative Complex * Error bars mark 95% confidence level Intro Background Current Proposed

Comparing NMF and CMF via ASR: Conclusion • Incorporating phase estimates into matrix factorization can improve source separation performance • Complex matrix factorization is worth further research [4] B. King and L. Atlas, “Single-Channel Source Separation Using Complex Matrix Factorization,” IEEE Transactions on Audio, Speech, and Language Processing (submitted). [5] B. King and L. Atlas, “Single-channel Source Separation using Simplified-training Complex Matrix Factorization,” International Conference on Acoustics, Speech, and Signal Processing, Dallas, TX: 2010. Intro Background Current Proposed

… but overparameterization? • can result in a potentially infinite number of solutions… which isn’t a good thing! • Example: estimate observation with 3 bases, #1 #2 #3 Intro Background Current Proposed

Review of Current Methods • Difficult to • Extend • Extendible PLSA ? • Overparameterization • Unique • Superposition • Additive NMF CMF Intro Background Current Proposed

Proposed Solution:Complex Probabilistic Latent Semantic Analysis (CPLSA) • Goal: incorporate phase observation and estimation into current nonnegative PLSA framework • Implicitly solves • Extensibility • Superposition • Proposal to solve • Overparameterization Intro Background Current Proposed

Proposed Solution: Outline • Transform complex to nonnegative data • 3 CPLSA variants • Phase constraints for STFT consistency • Unique solution Intro Background Current Proposed

Transform Complex to Nonnegative Data • Why is this important? • Modeling observed data Xf,tas a probability mass function • PMF’s are nonnegative, real • Observation needs to be nonnegative, real If then Intro Background Current Proposed

Transform Complex to Nonnegative Data • Starting point: Shashanka[8] • N real → N+1 nonnegative • Algorithm • N+1-length orthogonal vectors (AN+1,N) • Affine transform (for nonnegativity) • Normalize • My new, proposed method • N complex → 2N real • 2N real data → 2N+1 nonnegative [8] M. Shashanka, “Simplex Decompositions for Real-Valued Datasets,” IEEE International Workshop on Machine Learning for Signal Processing, 2009, pp. 1-6. Intro Background Current Proposed

Transform Complex to Nonnegative Data Intro Background Current Proposed

3 Variants of CPLSA • #1 Complex bases • Phase is associated with bases • Not a good model for STFT • #2 Nonnegative bases + base-dependent phases • Good model for audio, but overparameterized Intro Background Current Proposed

3 Variants of CPLSA • Nonnegative bases + source-dependent phases • Additive source model • Good model for audio • Fewer parameters • Simplifies to NMF for single-source case • Compare with CPLSA #2 Intro Background Current Proposed

Phase Constraints for STFT Consistency • STFT is consistent when • Incorporate STFT consistency [9] into phase estimation step for separated sources • Unique solution! [9] J. Le Roux, N. Ono, and S. Sagayama, “Explicit Consistency Constraints for STFT Spectrograms and Their Application to Phase Reconstruction,” 2008. Intro Background Current Proposed

Summary of Proposed Theory • Goal: incorporate phase observation and estimation into current nonnegative PLSA framework (extensible, additive, unique) • Theory • Transform complex to nonnegative data • 3 CPLSA variants • Phase constraints for STFT consistency Intro Background Current Proposed

Proposed Experiments • Separating speech in structured, nonstationary noise • Methods • CPLSA, PLSA, CMF • Noise • Babble noise • Automotive noise • Measurements • Objective perceptual • ASR Intro Background Current Proposed

Objective Measurement Tests • Goal: explore parameter space • How they affect performance in CPLSA • Find best-performing parameters • Compare performance of CPLSA with PLSA, CMF • Data • TIMIT corpus [10] • Measurements • Blind Source Separation Evaluation Toolbox [11] • Perceptual Evaluation of Speech Quality (PESQ) [12] [10] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, and N.L. Dahlgren, DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus, NIST, 1993. [11] E. Vincent, R. Gribonval, and C. Fevotte, “Performance Measurement in Blind Audio Source Separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, 2006, pp. 1462-1469. [12] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual Evaluation of Speech Quality (PESQ) - A New Method for Speech Quality Assessment of Telephone Networks and Codecs,” ICASSP, 2001, pp. 749-752 vol.2. Intro Background Current Proposed

Automatic Speech Recognition Tests • Goal: test robustness of parameters • Use best-performing parameters from objective measurements • Compare performance of CPLSA with PLSA, CMF • Data • Wall Street Journal corpus [13] • ASR System • Sphinx-3 (CMU) [13] D.B. Paul and J.M. Baker, “The Design for the Wall Street Journal-Based CSR Corpus,” Proceedings of the workshop on Speech and Natural Language, Stroudsburg, PA, USA: Association for Computational Linguistics, 1992, pp. 357–362. Intro Background Current Proposed

Examples

Subway Noise NMF 4.3 dB improvement Frequency (Hz) Time (s)

Subway Noise NMF 4.2 dB improvement Frequency (Hz) Time (s)

Fountain Noise Example #1 • Target speaker synthetically added at -3 dB SNR • Speaker model trained on 60 seconds clean speech

Fountain Noise Example #2 • No “clean speech” available for training of target talker • Generic speaker modelused

Mixed Speech (0 dB, no reverb)

Mixed Speech (0 dB, reverb)

Thank you!

Why not encode phase into bases? Individual phase term X B W ejθ Intro Background Current Proposed

Why not encode phase into bases? Complex B, W X B W Intro Background Current Proposed

BSS Evaluation Measures

… but superposition? Intro Background Current Proposed

Brian King , bbking@uw Advised by Les Atlas Electrical Engineering, University of Washington