Machine Learning for Signal Processing: Linear Gaussian Models Class Recap

Machine Learning for Signal ProcessingLinear Gaussian Models Class 17. 30 Oct 2014 Instructor: Bhiksha Raj 11755/18797

Recap: MAP Estimators • MAP (Maximum A Posteriori): Find a “best guess” for y (statistically), given known x y = argmax Y P(Y|x) 11755/18797

Recap: MAP estimation • x and y are jointly Gaussian • z is Gaussian 11755/18797

MAP estimation: Gaussian PDF Y F1 X 11755/18797

MAP estimation: The Gaussian at a particular value of X x0 11755/18797

Conditional Probability of y|x • The conditional probability of y given x is also Gaussian • The slice in the figure is Gaussian • The mean of this Gaussian is a function of x • The variance of y reduces if x is known • Uncertainty is reduced 11755/18797

MAP estimation: The Gaussian at a particular value of X Most likelyvalue F1 x0 11755/18797

MAP Estimation of a Gaussian RV x0 11755/18797

Its also a minimum-mean-squared error estimate • Minimize error: • Differentiating and equating to 0: The MMSE estimate is the mean of the distribution 11755/18797

For the Gaussian: MAP = MMSE Most likelyvalue is also The MEANvalue • Would be true of any symmetric distribution 11755/18797

A Likelihood Perspective • y is a noisy reading of aTx • Error e is Gaussian • Estimate A from = + 11755/18797

The Likelihood of the data • Probability of observing a specific y, given x, for a particular matrix a • Probability of collection: • Assuming IID for convenience (not necessary) 11755/18797

A Maximum Likelihood Estimate • Maximizing the log probability is identical to minimizing the least squared error 11755/18797

A problem with regressions • ML fit is sensitive • Error is squared • Small variations in data  large variations in weights • Outliers affect it adversely • Unstable • If dimension of X >= no. of instances • (XXT) is not invertible 11755/18797

MAP estimation of weights y=aTX+e a • Assume weights drawn from a Gaussian • P(a) = N(0,s2I) • Max. Likelihood estimate • Maximum a posteriori estimate X e 11755/18797

MAP estimation of weights • Similar to ML estimate with an additional term • P(a) = N(0,s2I) • Log P(a) = C – log s – 0.5s-2 ||a||2 11755/18797

MAP estimate of weights • Equivalent to diagonal loading of correlation matrix • Improves condition number of correlation matrix • Can be inverted with greater stability • Will not affect the estimation from well-conditioned data • Also called Tikhonov Regularization • Dual form: Ridge regression • MAP estimate of weights • Not to be confused with MAP estimate of Y 11755/18797

MAP estimate priors • Left: Gaussian Prior on W • Right: Laplacian Prior 11755/18797

MAP estimation of weights with laplacian prior • Assume weights drawn from a Laplacian • P(a) = l-1exp(-l-1|a|1) • Maximum a posteriori estimate • No closed form solution • Quadratic programming solution required • Non-trivial 11755/18797

MAP estimation of weights with laplacian prior • Assume weights drawn from a Laplacian • P(a) = l-1exp(-l-1|a|1) • Maximum a posteriori estimate • … • Identical to L1 regularized least-squares estimation 11755/18797

L1-regularized LSE • No closed form solution • Quadratic programming solutions required • Dual formulation • “LASSO” – Least absolute shrinkage and selection operator subject to 11755/18797

LASSO Algorithms • Various convex optimization algorithms • LARS: Least angle regression • Pathwise coordinate descent.. • Matlab code available from web 11755/18797

Regularized least squares • Regularization results in selection of suboptimal (in least-squares sense) solution • One of the loci outside center • Tikhonovregularization selectsshortestsolution • L1regularization selectssparsest solution Image Credit: Tibshirani 11755/18797

LASSO and Compressive Sensing • Given Y and X, estimate sparse W • LASSO: • X = explanatory variable • Y = dependent variable • a = weights of regression • CS: • X = measurement matrix • Y = measurement • a = data = X Y a 11755/18797

MAP / ML / MMSE • General statistical estimators • All used to predict a variable, based on other parameters related to it.. • Most common assumption: Data are Gaussian, all RVs are Gaussian • Other probability densities may also be used.. • For Gaussians relationships are linear as we saw.. 11755/18797

Gaussians and more Gaussians.. • Linear Gaussian Models.. • But first a recap 11755/18797

A Brief Recap • Principal component analysis: Find the K bases that best explain the given data • Find B and C such that the difference between D and BC is minimum • While constraining that the columns of B are orthonormal 11755/18797

Remember Eigenfaces • Approximate everyface f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk • Estimate V to minimize the squared error • Error is unexplained by V1.. Vk • Error is orthogonal to Eigenfaces 11755/18797

Karhunen Loeve vs. PCA • Eigenvectors of the Correlation matrix: • Principal directions of tightest ellipse centered on origin • Directions that retain maximum energy 11755/18797

Karhunen Loeve vs. PCA • Eigenvectors of the Correlation matrix: • Principal directions of tightest ellipse centered on origin • Directions that retain maximum energy • Eigenvectors of the Covariance matrix: • Principal directions of tightest ellipse centered on data • Directions that retain maximum variance 11755/18797

Karhunen Loeve vs. PCA • Eigenvectors of the Correlation matrix: • Principal directions of tightest ellipse centered on origin • Directions that retain maximum energy KARHUNEN LOEVE TRANSFORM • Eigenvectors of the Covariance matrix: • Principal directions of tightest ellipse centered on data • Directions that retain maximum variance 11755/18797

Karhunen Loeve vs. PCA • Eigenvectors of the Correlation matrix: • Principal directions of tightest ellipse centered on origin • Directions that retain maximum energy Principal Component Analysis KARHUNEN LOEVE TRANSFORM • Eigenvectors of the Covariance matrix: • Principal directions of tightest ellipse centered on data • Directions that retain maximum variance 11755/18797

Karhunen Loeve vs. PCA • If the data are naturally centered at origin, KLT == PCA • Following slides refer to PCA! • Assume data centered at origin for simplicity • Not essential, as we will see.. 11755/18797

Remember Eigenfaces • Approximate everyface f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk • Estimate V to minimize the squared error • Error is unexplained by V1.. Vk • Error is orthogonal to Eigenfaces 11755/18797

Eigen Representation = w11 + e1 • K-dimensional representation • Error is orthogonal to representation • Weight and error are specific to data instance 0 w11 e1 Illustration assuming 3D space 11755/18797

Representation = w12 + e2 Error is at 90oto the eigenface • K-dimensional representation • Error is orthogonal to representation • Weight and error are specific to data instance w12 90o e2 Illustration assuming 3D space 11755/18797

Representation • K-dimensional representation • Error is orthogonal to representation All data with the samerepresentation wV1lie a plane orthogonal towV1 0 w 11755/18797

With 2 bases = w11 + w21 + e1 Error is at 90oto the eigenfaces • K-dimensional representation • Error is orthogonal to representation • Weight and error are specific to data instance 0,0 e1 w11 w21 Illustration assuming 3D space 11755/18797

With 2 bases = w12 + w22 + e2 Error is at 90oto the eigenfaces • K-dimensional representation • Error is orthogonal to representation • Weight and error are specific to data instance w12 w22 e2 Illustration assuming 3D space 11755/18797

In Vector Form Xi = w1iV1 + w2iV2 + ei Error is at 90oto the eigenfaces • K-dimensional representation • Error is orthogonal to representation • Weight and error are specific to data instance w12 w22 V2 D2 e2 V1 11755/18797

In Vector Form Xi = w1iV1 + w2iV2 + ei Error is at 90oto the eigenface • K-dimensional representation • x is a D dimensional vector • V is a D x K matrix • w is a K dimensional vector • e is a D dimensional vector w12 w22 V2 D2 e2 V1 11755/18797

Learning PCA • For the given data: find the K-dimensional subspace such that it captures most of the variance in the data • Variance in remaining subspace is minimal 11755/18797

Constraints Error is at 90oto the eigenface • VTV = I : Eigen vectors are orthogonal to each other • For every vector, error is orthogonal to Eigen vectors • eTV = 0 • Over the collection of data • Average wwT = Diagonal : Eigen representations are uncorrelated • Determinant eTe = minimum: Error variance is minimum • Mean of error is 0 w12 w22 V2 D2 e2 V1 11755/18797

A Statistical Formulation of PCA Error is at 90oto the eigenface • x is a random variable generated according to a linear relation • w is drawn from an K-dimensional Gaussian with diagonal covariance • e is drawn from a 0-mean (D-K)-rank D-dimensional Gaussian • Estimate V (and B) given examples of x w12 w22 V2 D2 e2 V1 11755/18797

Linear Gaussian Models!! • x is a random variable generated according to a linear relation • w is drawn from a Gaussian • e is drawn from a 0-mean Gaussian • Estimate V given examples of x • In the process also estimate B and E 11755/18797

Linear Gaussian Models!! • x is a random variable generated according to a linear relation • w is drawn from a Gaussian • e is drawn from a 0-mean Gaussian • Estimate V given examples of x • In the process also estimate B and E • PCA is a specific instance of a linear Gaussian model with particularconstraints • B = Diagonal • VTV = I • E is low rank 11755/18797

Linear Gaussian Models • Observations are linear functions of two uncorrelated Gaussian random variables • A “weight” variable w • An “error” variable e • Error not correlated to weight: E[eTw] = 0 • Learning LGMs: Estimate parameters of the model given instances of x • The problem of learning the distribution of a Gaussian RV 11755/18797

LGMs: Probability Density • The mean of x: • The Covariance of x: 11755/18797

The probability of x • x is a linear function of Gaussians: x is also Gaussian • Its mean and variance are as given 11755/18797

Estimating the variables of the model • Estimating the variables of the LGM is equivalent to estimating P(x) • The variables are m, V, B and E 11755/18797

Machine Learning for Signal Processing: Linear Gaussian Models Class Recap