Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model

Variational and Scale Mixture Density Representationsfor Estimation in the Bayesian Linear Model: Sparse Coding, Independent Component Analysis, and Minimum Entropy Segmentation Jason Palmer Department of Electrical and Computer Engineering University of California San Diego

Introduction • Unsupervised learning of structure in continuous sensor data • Data must be analyzed into component parts – reduced to set of states of the world which are active or not active in various combinations • Probabilistic modeling – states • Linear model • Basis sets • Hierarchical Linear processes • Also kernel non-linear • Probability model • Distributions of input variables – types of densities • Conditionally independent inputs – Markov connection of states • Thesis topics • Types of distributions and representations that lead to efficient and monotonic algorithms using non-Gaussian densities • Calculating probabilities in linear process model

Linear Process Model • General form of the model: • Sparse coding: x(t)=As(t), A overcomplete • ICA: x(t)=As(t), A invertible, s(t)=A-1x(t) • Blind deconvolution:

Voice and Microphone IID process Linear filters Observed process HT(z) HR(z) HM(z)

Sensor Arrays Source 1 Source 2 Sensor array

Convolutive Model

Different Impulse Response

EEG sources

Binocular Color Image Channels - Represented as 2D field of 6D vectors - Binocular video can be represented as a 3D field of 6D vectors - Use block basis or mixture of filters rg b R L rg b

Biological Systems WORLD x(t) WORLD  (t) REPRESENTATION

Linear Process Mixture Model S-s-s-s-s-s-i-i-i-k---ks-s-s-s-s-s----t-t-t-t-e-e-e-e-n-n-n-n • Plot of speech signal: woman speaking the word “sixteen” • Clearly speech is non-stationary, but seems to be locally stationary

Source model 11(t) 1(z) 1m(t) s(t) r1(t) r(z) rm(t)

Observation Segment Model A(z) x(t)

Generative Model A1(z) x1(t) x(t) AM(z) xM(t)

Outline • Types of Probability densities • Sub- and Super-Gaussianity • Representation in terms of Gaussian • Convex variational representation – Strong Super-Gaussians • Gaussian Scale mixtures • Multivariate Gaussian Scale mixtures (ISA / IVA) • Relationship between representations • Sparse Coding and Dictionary Learning • Optimization with given (overcomplete) basis • MAP – Generalized FOCUSS • Global Convergence of Iteratively Re-weighted Least Squares (IRLS) • Convergence rates • Variational Bayes (VB) – Sparse Bayesian Learning • Dictionary Learning • Lagrangian Newton Algorithm • Comparison in Monte Carlo experiment • Independent Component Analysis • Convexity of the ICA optimization problem – stability of ICA • Fisher Information and Cramer-Rao lower bound on variance • Super-Gaussian Mixture Source Model • Comparison between Gaussian and Super-Gaussian updates • Linear Process Mixture Model • Probability of signal segments • Mixture model segmentation

Sub- and Super-Gaussianity • Super-Gaussian = more peaked, than Gaussian,heavier tail • Sub-Gaussian = flatter, more uniform, shorter tail than Gaussian Super-Gaussian Gaussian Sub-Gaussian • Component density determines shape along direction of vector • Super-Gaussian = concentrated near zero, some large values • Sub-Gaussian = uniform around zero, no large values Generalized Gaussian  exp(-|x|p):Laplacian (p = 1.0 ), Gaussian (p=2.0), sub-Gaussian (p=10.0) Sub- AND Super-Gaussian • Super-Gaussians, represent sparserandom variables • Most often zero, occasionally large magnitudes • Sparse random variables model variables with on / off, active / inactive states

Convex Variational Representation convex concave Convex: • Convex / concave functions are pointwise supremum / infimum of linear functions • Convex function f (x)may be concave in x2, i.e. f (x) = g(x2), and g is concave on (0,). • Example: |x|3/2 convex |x|3/4 concave • Example: |x|4 convex |x|2 still convex Concave: x4 x2 x3/2 x2 x3/2 concave in x2 x4 convex in x2 • If f (x) is concave in x2, and p(x)= exp(-f (x)): We say p(x) is Strongly Super-Gaussian • If f (x) is convex in x2, and p(x) = exp(-f (x)):

Scale Mixture Representation Gaussian Scale Mixture • Gaussian Scale Mixtures (GSMs) are sums of Gaussians densities with different variances, but all zero mean: • A random variable with a GSM density can be represented as a product of Standard Normal random variable Z,and an arbitrary non-negative random variable W: Gaussians • Multivariate densities can be modeled by product non-negative scalar and Gaussian random vector: X = Z W -1/2 • Contribution: general formula for multivariate GSMs:

Relationship between Representations • Criterion for p(x) = exp(-f (x)) = exp(-g(x2)) to be have convex variational representation: • Criterion for GSM representation given by Bernstein-Widder Theorem on complete monotonicity (CM): • For Gaussian representation, need CM • CM relationship (Bochner): • If is CM, then , and thus the GSM representation implies the convex variational representation.

Sparse Regression –Variational MAP • Bayesian Linear Model x=As+v: basis A, sources s, noise v • Can always put in form: min f (s) subject to As = x, A overcomplete • For Strongly Super-Gaussian priors, p(s) = exp(-f (s)): • Sources are independent, cost function f(s)=i f (si),(s) diagonal: • Solve: • sold satisfies As = x, so right side is negative, so left side is negative

Sparse Regression – MAP – GSM • For Gaussian Scale Mixture p(s), we have s = z -1/2, and s is conditionally Gaussian given  EM algorithm • The complete log likelihood is quadratic since s is conditionally Gaussian: • Linear in . For EM we need expected value of  given x.But  sx is a Markov chain: • GSM EM algorithm is thus the same as the Strong Super-Gaussian algorithm – both are Iteratively Reweighted Least Squares (IRLS)

Generalized FOCUSS • The FOCUSS algorithm is a particular MAP algorithm for sparse regression f (s) = |s|p or f (s) = log s.It was derived by Gorodnitsky and Rao (1997), and Rao and Kreutz-Delgado (1998) • With arbitrary Strongly Super-Gaussian source prior, Generalized FOCUSS: • Convergence is proved using Zangwill’s Global Convergence Theorem, which requires: (1)Descent function(2)Boundedness of iterates, and(3)closure (continuity) of algorithm mapping. • We prove a general theorem on boundedness of IRLS iterations with diagonal weight matrix: least squares solution alwayslies in bounded part of orthant intersection Least squares solution Unbounded orthant-constraint intersection • We also derive the convergence rate of Generalized FOCUSS for f (s) is convex. Convergence rate for concave f(s) was proved by Gorodnitsky and Rao. We give an alternative proof. Bounded orthant-constraint intersection

Variational Bayes • General form of Sparse Bayesian Learning / Type II ML: • Find Normal density (mean and covariance) that minimizes an upper bound on KL divergence from true posterior density: • OR: MAP estimate of hyperparameters, , in GSM (instead of s). • OR: Variational Bayes algorithm which finds the separable posterior q(s|x)q(|x) that minimizes KL divergence from true posterior p(s, |x). • The bound is derived using a modified Jensen’s inequality: • Then minimize the bound by coordinate descent as before. Also IRLS, same functional form but now diagonal weights are:

Sparse Regression Example • An example of sparse regression with an overcomplete basis • The line is the one dimensional solution space (translated null space) • Below the posterior density p(s|x) in null space plotted for Generalized Gaussian with p=1.0, p=0.5, and p=0.2

Dictionary Learning • Problem: Given data x1,…,xNfind an (overcomplete) basis A for whichAs=x and the sources are sparse. • Three algorithms: • (1) Lewicki-Sejnowski ICA (2) Kreutz-Delgado FOCUSS based (3) Girolami VB based algorithm • We derived a Lagrangian Newton algorithm similar to Kreutz-Delgado’s algorithm • Lewicki-Sejnowski • Kreutz-Delgado • Girolami VB • Lagrangian Newton These algorithms have the general form:

Dictionary Learning Monte Carlo A 2 x 3, sparsity 1 • Experiment: generate random A matrices, sparse sources s, and data x=As, N=100 m. • Test algorithms: • Girolami, p=1.0, Jeffrey’s • Lagrangian Newton, p=1.0, p=1.1 • Kreutz-Delgado, (non-)normalized • Lewicki-Sejnowski, p=1.1, Logistic A 4 x 8, sparsity 2 A 10 x 20, sparsity 1-5

Sparse Coding of EEG • Goal: find synchronous “events” in multiple interesting components • Learn basis for segments, length 100, across 5 channels • Events are rare, so the prior density is sparse EEG scalp maps:

EEG Segment Basis: Subspace 1 • Experimental task: subject sees sequence of letters, click left mouse if the letter is same as two letters back, if not click right • Each column is a basis vector: segment of length 100 x 5 channels • Only the second channel active in this subspace – related to incorrect response by subject – subject hears buzzer when wrong response given • Dictionary learning with time series: must learn phase shifts

EEG Segment Basis: Subspace 2 • In this subspace, channels 1 and 3 are active • Channel 3 crest slightly precedes channel 1 crest • This subspace is associated with correct response

EEG Segment Basis: Subspace 3 • In this subspace, channels 1 and 2 have phase shifted 20 Hz bursts • Not obviously associated with any recorded event

ICA • ICA model: x=As, with A invertible, s=Wx • Maximum Likelihood estimate of W=A-1 : • For independent sources: • Source densities unknown, must be adapted – Quasi-ML (Pham 92) • Since ML minimizes KL divergence over parametric family, ML with ICA model is equivalent to minimizing Mutual Information • If sources areGaussian, A cannot be identified, only covariance • If sources are Non-Gaussian, A can be identified (Cheng, Rosenblatt)

ICA Hessian • Remarkably, the expected value of the Hessian of the ICA ML cost function can be calculated. • Work with the “global system” C = WA, whose optimum is always identity, C* = I. • Using independence of sources at the optimum, we can block diagonalize the Hessian linear operator H(B)=D in the global space into 2 x 2 blocks: • Expected Hessian is the Fisher Information matrix • Inverse is Cramer-Rao lower bound on unbiased estimator variance • Plot shows bound for off-diagonal element with Gen. Gauss. prior • Hessian also allows Newton method • For EM stability, replaced by: • Main condition for positive definiteness and convexity of ML problem at the optimum:

Super-Gaussian Mixture Model • Variational formulation also allows derivation of generalization of Gaussian mixture model to strongly super-Gaussian mixtures: • The update rules are similar to the Gaussian mixture model, but include the variational parameters 

Source Mixture Model Examples

ICA Mixture Model – Images • Goal: find an efficient basis for representing image patches. Data vectors are 12 x 12 blocks.

Covariance Square Root Sphere Basis

ICA: Single Basis

ICA Mixture Model: Model 1

ICA Mixture Model: Model 2

Image Segmentation 1 Using the learned models, we classify each image block as from Model 1 or Model 2 Lower left shows raw probability for Model 1 Lower right shows binary segmentation Blue captures high frequency ground

Image Segmentation 2 Again we classify each image block as from Model 1 or Model 2 Lower left shows raw probability for Model 1 Lower right shows binary segmentation Blue captures high frequency tree bark

Image model 1 basis densities • Low frequency components are not sparse, and may be multimodal • Edge filters in Model 1 are not as sparse as the higher frequency components of Model 2

Image Model 2 densities • Densities are very sparse • Higher frequency components occur less often in the data • Convergence is less smooth

Gen.Gauss. shape parameter histograms Image bases EEG bases 1.2 2.0 1.2 2.0 More sparse, edge filters, etc. Less sparse, biological signals

Rate Distortion Theory • Theorem of Gray shows that given a finite autoregressive process, the optimal rate transform is the inverse of the mixing filter • For difference distortion measures: • Proof seems to extend to general linear systems, and potentially mixture models • To the extent that Linear Process Mixture Models can model arbitrary piecewise linear random processes, linear mixture deconvolution is a general coding scheme with optimal rate Z(t) H(z) X(t) H-1(z) Z(t) RZ (D)  RX (D)

Time Series Segment Likelihood • Multichannel convolution is a linear operation • Matrix is block Toeplitz • To calculate likelihood, need determinant • Extension of Szegö limit theorem • Can be extended to multi-dimensional fields, e.g. image convolution

Segmented EEG source time series • Linear Process Mixture Model run on several source – 2 models • Coherent activity is identified and segmented blindly • Spectral density resolution greatly enhanced by eliminating noise

Spectral Density Enhancement • Spectra before segmentation / rejection (left) and after (right). • Spectral peaks invisible in all series spectrum becom visible in segmented spectrum All series spectrum Segmented spectrum Source A channel Source B channel

Future Work • Fully implement hierarchical linear process model • Implement Hidden Markov Model to learn relationships among various model states • Test new multivariate dependent density models • Implement multivariate convolutive model, e.g. on images to learn wavelets, and test video coding rates • Implement Linear Process Mixture Model in VLSI circuits

Publications • “A Globally Convergent Algorithm for MAP Estimation with Non-Gaussian Priors,” Proceedings of the 36th Asilomar Conference on Signals and Systems, 2002. • “A General Framework for Component Estimation,” Proceedings of the 4th International Symposium on Independent Component Analysis, 2003. • “Variational EM Algorithms for Non-Gaussian Latent Variable Models,” Advances in Neural Information Processing Systems, 2005. • “Super-Gaussian Mixture Source Model for ICA,” Proceedings of the 6th International Symposium on Independent Component Analysis, 2006. • “Linear Process Mixture Model for Piecewise Stationary Multichannel Blind Deconvolution,” submitted ICASSP 2007

Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model