1 / 51

Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model

This paper explores unsupervised learning of structure in continuous sensor data using probabilistic modeling and different representations, such as sparse coding and independent component analysis.

mkelli
Download Presentation

Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Variational and Scale Mixture Density Representationsfor Estimation in the Bayesian Linear Model: Sparse Coding, Independent Component Analysis, and Minimum Entropy Segmentation Jason Palmer Department of Electrical and Computer Engineering University of California San Diego

  2. Introduction • Unsupervised learning of structure in continuous sensor data • Data must be analyzed into component parts – reduced to set of states of the world which are active or not active in various combinations • Probabilistic modeling – states • Linear model • Basis sets • Hierarchical Linear processes • Also kernel non-linear • Probability model • Distributions of input variables – types of densities • Conditionally independent inputs – Markov connection of states • Thesis topics • Types of distributions and representations that lead to efficient and monotonic algorithms using non-Gaussian densities • Calculating probabilities in linear process model

  3. Linear Process Model • General form of the model: • Sparse coding: x(t)=As(t), A overcomplete • ICA: x(t)=As(t), A invertible, s(t)=A-1x(t) • Blind deconvolution:

  4. Voice and Microphone IID process Linear filters Observed process HT(z) HR(z) HM(z)

  5. Sensor Arrays Source 1 Source 2 Sensor array

  6. Convolutive Model

  7. Different Impulse Response

  8. EEG sources

  9. Binocular Color Image Channels - Represented as 2D field of 6D vectors - Binocular video can be represented as a 3D field of 6D vectors - Use block basis or mixture of filters rg b R L rg b

  10. Biological Systems WORLD x(t) WORLD  (t) REPRESENTATION

  11. Linear Process Mixture Model S-s-s-s-s-s-i-i-i-k---ks-s-s-s-s-s----t-t-t-t-e-e-e-e-n-n-n-n • Plot of speech signal: woman speaking the word “sixteen” • Clearly speech is non-stationary, but seems to be locally stationary

  12. Source model 11(t) 1(z) 1m(t) s(t) r1(t) r(z) rm(t)

  13. Observation Segment Model A(z) x(t)

  14. Generative Model A1(z) x1(t) x(t) AM(z) xM(t)

  15. Outline • Types of Probability densities • Sub- and Super-Gaussianity • Representation in terms of Gaussian • Convex variational representation – Strong Super-Gaussians • Gaussian Scale mixtures • Multivariate Gaussian Scale mixtures (ISA / IVA) • Relationship between representations • Sparse Coding and Dictionary Learning • Optimization with given (overcomplete) basis • MAP – Generalized FOCUSS • Global Convergence of Iteratively Re-weighted Least Squares (IRLS) • Convergence rates • Variational Bayes (VB) – Sparse Bayesian Learning • Dictionary Learning • Lagrangian Newton Algorithm • Comparison in Monte Carlo experiment • Independent Component Analysis • Convexity of the ICA optimization problem – stability of ICA • Fisher Information and Cramer-Rao lower bound on variance • Super-Gaussian Mixture Source Model • Comparison between Gaussian and Super-Gaussian updates • Linear Process Mixture Model • Probability of signal segments • Mixture model segmentation

  16. Sub- and Super-Gaussianity • Super-Gaussian = more peaked, than Gaussian,heavier tail • Sub-Gaussian = flatter, more uniform, shorter tail than Gaussian Super-Gaussian Gaussian Sub-Gaussian • Component density determines shape along direction of vector • Super-Gaussian = concentrated near zero, some large values • Sub-Gaussian = uniform around zero, no large values Generalized Gaussian  exp(-|x|p):Laplacian (p = 1.0 ), Gaussian (p=2.0), sub-Gaussian (p=10.0) Sub- AND Super-Gaussian • Super-Gaussians, represent sparserandom variables • Most often zero, occasionally large magnitudes • Sparse random variables model variables with on / off, active / inactive states

  17. Convex Variational Representation convex concave Convex: • Convex / concave functions are pointwise supremum / infimum of linear functions • Convex function f (x)may be concave in x2, i.e. f (x) = g(x2), and g is concave on (0,). • Example: |x|3/2 convex |x|3/4 concave • Example: |x|4 convex |x|2 still convex Concave: x4 x2 x3/2 x2 x3/2 concave in x2 x4 convex in x2 • If f (x) is concave in x2, and p(x)= exp(-f (x)): We say p(x) is Strongly Super-Gaussian • If f (x) is convex in x2, and p(x) = exp(-f (x)):

  18. Scale Mixture Representation Gaussian Scale Mixture • Gaussian Scale Mixtures (GSMs) are sums of Gaussians densities with different variances, but all zero mean: • A random variable with a GSM density can be represented as a product of Standard Normal random variable Z,and an arbitrary non-negative random variable W: Gaussians • Multivariate densities can be modeled by product non-negative scalar and Gaussian random vector: X = Z W -1/2 • Contribution: general formula for multivariate GSMs:

  19. Relationship between Representations • Criterion for p(x) = exp(-f (x)) = exp(-g(x2)) to be have convex variational representation: • Criterion for GSM representation given by Bernstein-Widder Theorem on complete monotonicity (CM): • For Gaussian representation, need CM • CM relationship (Bochner): • If is CM, then , and thus the GSM representation implies the convex variational representation.

  20. Sparse Regression –Variational MAP • Bayesian Linear Model x=As+v: basis A, sources s, noise v • Can always put in form: min f (s) subject to As = x, A overcomplete • For Strongly Super-Gaussian priors, p(s) = exp(-f (s)): • Sources are independent, cost function f(s)=i f (si),(s) diagonal: • Solve: • sold satisfies As = x, so right side is negative, so left side is negative

  21. Sparse Regression – MAP – GSM • For Gaussian Scale Mixture p(s), we have s = z -1/2, and s is conditionally Gaussian given  EM algorithm • The complete log likelihood is quadratic since s is conditionally Gaussian: • Linear in . For EM we need expected value of  given x.But  sx is a Markov chain: • GSM EM algorithm is thus the same as the Strong Super-Gaussian algorithm – both are Iteratively Reweighted Least Squares (IRLS)

  22. Generalized FOCUSS • The FOCUSS algorithm is a particular MAP algorithm for sparse regression f (s) = |s|p or f (s) = log s.It was derived by Gorodnitsky and Rao (1997), and Rao and Kreutz-Delgado (1998) • With arbitrary Strongly Super-Gaussian source prior, Generalized FOCUSS: • Convergence is proved using Zangwill’s Global Convergence Theorem, which requires: (1)Descent function(2)Boundedness of iterates, and(3)closure (continuity) of algorithm mapping. • We prove a general theorem on boundedness of IRLS iterations with diagonal weight matrix: least squares solution alwayslies in bounded part of orthant intersection Least squares solution Unbounded orthant-constraint intersection • We also derive the convergence rate of Generalized FOCUSS for f (s) is convex. Convergence rate for concave f(s) was proved by Gorodnitsky and Rao. We give an alternative proof. Bounded orthant-constraint intersection

  23. Variational Bayes • General form of Sparse Bayesian Learning / Type II ML: • Find Normal density (mean and covariance) that minimizes an upper bound on KL divergence from true posterior density: • OR: MAP estimate of hyperparameters, , in GSM (instead of s). • OR: Variational Bayes algorithm which finds the separable posterior q(s|x)q(|x) that minimizes KL divergence from true posterior p(s, |x). • The bound is derived using a modified Jensen’s inequality: • Then minimize the bound by coordinate descent as before. Also IRLS, same functional form but now diagonal weights are:

  24. Sparse Regression Example • An example of sparse regression with an overcomplete basis • The line is the one dimensional solution space (translated null space) • Below the posterior density p(s|x) in null space plotted for Generalized Gaussian with p=1.0, p=0.5, and p=0.2

  25. Dictionary Learning • Problem: Given data x1,…,xNfind an (overcomplete) basis A for whichAs=x and the sources are sparse. • Three algorithms: • (1) Lewicki-Sejnowski ICA (2) Kreutz-Delgado FOCUSS based (3) Girolami VB based algorithm • We derived a Lagrangian Newton algorithm similar to Kreutz-Delgado’s algorithm • Lewicki-Sejnowski • Kreutz-Delgado • Girolami VB • Lagrangian Newton These algorithms have the general form:

  26. Dictionary Learning Monte Carlo A 2 x 3, sparsity 1 • Experiment: generate random A matrices, sparse sources s, and data x=As, N=100 m. • Test algorithms: • Girolami, p=1.0, Jeffrey’s • Lagrangian Newton, p=1.0, p=1.1 • Kreutz-Delgado, (non-)normalized • Lewicki-Sejnowski, p=1.1, Logistic A 4 x 8, sparsity 2 A 10 x 20, sparsity 1-5

  27. Sparse Coding of EEG • Goal: find synchronous “events” in multiple interesting components • Learn basis for segments, length 100, across 5 channels • Events are rare, so the prior density is sparse EEG scalp maps:

  28. EEG Segment Basis: Subspace 1 • Experimental task: subject sees sequence of letters, click left mouse if the letter is same as two letters back, if not click right • Each column is a basis vector: segment of length 100 x 5 channels • Only the second channel active in this subspace – related to incorrect response by subject – subject hears buzzer when wrong response given • Dictionary learning with time series: must learn phase shifts

  29. EEG Segment Basis: Subspace 2 • In this subspace, channels 1 and 3 are active • Channel 3 crest slightly precedes channel 1 crest • This subspace is associated with correct response

  30. EEG Segment Basis: Subspace 3 • In this subspace, channels 1 and 2 have phase shifted 20 Hz bursts • Not obviously associated with any recorded event

  31. ICA • ICA model: x=As, with A invertible, s=Wx • Maximum Likelihood estimate of W=A-1 : • For independent sources: • Source densities unknown, must be adapted – Quasi-ML (Pham 92) • Since ML minimizes KL divergence over parametric family, ML with ICA model is equivalent to minimizing Mutual Information • If sources areGaussian, A cannot be identified, only covariance • If sources are Non-Gaussian, A can be identified (Cheng, Rosenblatt)

  32. ICA Hessian • Remarkably, the expected value of the Hessian of the ICA ML cost function can be calculated. • Work with the “global system” C = WA, whose optimum is always identity, C* = I. • Using independence of sources at the optimum, we can block diagonalize the Hessian linear operator H(B)=D in the global space into 2 x 2 blocks: • Expected Hessian is the Fisher Information matrix • Inverse is Cramer-Rao lower bound on unbiased estimator variance • Plot shows bound for off-diagonal element with Gen. Gauss. prior • Hessian also allows Newton method • For EM stability, replaced by: • Main condition for positive definiteness and convexity of ML problem at the optimum:

  33. Super-Gaussian Mixture Model • Variational formulation also allows derivation of generalization of Gaussian mixture model to strongly super-Gaussian mixtures: • The update rules are similar to the Gaussian mixture model, but include the variational parameters 

  34. Source Mixture Model Examples

  35. ICA Mixture Model – Images • Goal: find an efficient basis for representing image patches. Data vectors are 12 x 12 blocks.

  36. Covariance Square Root Sphere Basis

  37. ICA: Single Basis

  38. ICA Mixture Model: Model 1

  39. ICA Mixture Model: Model 2

  40. Image Segmentation 1 Using the learned models, we classify each image block as from Model 1 or Model 2 Lower left shows raw probability for Model 1 Lower right shows binary segmentation Blue captures high frequency ground

  41. Image Segmentation 2 Again we classify each image block as from Model 1 or Model 2 Lower left shows raw probability for Model 1 Lower right shows binary segmentation Blue captures high frequency tree bark

  42. Image model 1 basis densities • Low frequency components are not sparse, and may be multimodal • Edge filters in Model 1 are not as sparse as the higher frequency components of Model 2

  43. Image Model 2 densities • Densities are very sparse • Higher frequency components occur less often in the data • Convergence is less smooth

  44. Gen.Gauss. shape parameter histograms Image bases EEG bases 1.2 2.0 1.2 2.0 More sparse, edge filters, etc. Less sparse, biological signals

  45. Rate Distortion Theory • Theorem of Gray shows that given a finite autoregressive process, the optimal rate transform is the inverse of the mixing filter • For difference distortion measures: • Proof seems to extend to general linear systems, and potentially mixture models • To the extent that Linear Process Mixture Models can model arbitrary piecewise linear random processes, linear mixture deconvolution is a general coding scheme with optimal rate Z(t) H(z) X(t) H-1(z) Z(t) RZ (D)  RX (D)

  46. Time Series Segment Likelihood • Multichannel convolution is a linear operation • Matrix is block Toeplitz • To calculate likelihood, need determinant • Extension of Szegö limit theorem • Can be extended to multi-dimensional fields, e.g. image convolution

  47. Segmented EEG source time series • Linear Process Mixture Model run on several source – 2 models • Coherent activity is identified and segmented blindly • Spectral density resolution greatly enhanced by eliminating noise

  48. Spectral Density Enhancement • Spectra before segmentation / rejection (left) and after (right). • Spectral peaks invisible in all series spectrum becom visible in segmented spectrum All series spectrum Segmented spectrum Source A channel Source B channel

  49. Future Work • Fully implement hierarchical linear process model • Implement Hidden Markov Model to learn relationships among various model states • Test new multivariate dependent density models • Implement multivariate convolutive model, e.g. on images to learn wavelets, and test video coding rates • Implement Linear Process Mixture Model in VLSI circuits

  50. Publications • “A Globally Convergent Algorithm for MAP Estimation with Non-Gaussian Priors,” Proceedings of the 36th Asilomar Conference on Signals and Systems, 2002. • “A General Framework for Component Estimation,” Proceedings of the 4th International Symposium on Independent Component Analysis, 2003. • “Variational EM Algorithms for Non-Gaussian Latent Variable Models,” Advances in Neural Information Processing Systems, 2005. • “Super-Gaussian Mixture Source Model for ICA,” Proceedings of the 6th International Symposium on Independent Component Analysis, 2006. • “Linear Process Mixture Model for Piecewise Stationary Multichannel Blind Deconvolution,” submitted ICASSP 2007

More Related