Pattern Recognition and Machine Learning

Lars Kasper, December 15th 2010 Pattern Recognition and Machine Learning Chapter 12: Continuous Latent Variables

Relation To Other Topics Last weeks: Approximate Inference Today: Back to data-preprocessing Data representation/Feature extraction “Model-free” analysis Dimensionality reduction The matrix Link: We also have a (particular easy) model of the underlying state of the world whose parameters we want to infer from the data

Take-home TLAs (Three-letter acronyms) Although termed “continuous latent variables”, we mainly deal with PCA (PrincipalComponent Analysis) ICA (Independent Component Analysis) Factoranalysis General motivation/theme: “What is interesting about my data – but hidden (latent)? … And what is just noise?”

Importance Sampling ;-) Publications concerning fMRIand (PCA or ICA orfactor Analysis) Source: ISI Web ofKnowledge, Dec 13th, 2010

Importance Sampling: fMRI Used for fMRI analysis, e.g. software package FSL: “MELODIC” MELODIC Tutorial: 2nd principal component (eigenimage) and corresponding time series of a visual block stimulation

Motivation: Low intrinsic dimensionality • Generating hand-written digit samples by translating and rotating one example 100 times • High dimensional data (100 x 100 pixel) • Low degrees of freedom (1 rotation angle, 2 translations)

Roadmap fortoday

Heuristic PCA: Projection View 2D-data Projected on 1D-line • How do we simplify or compress our data (make it low-dimensional) without losing actual information? • Dimensionality reduction by projecting on a linear subspace

Heuristic PCA: DimensionalityReduction • Advantages: • Reducedamountofdata • Mightbeeasiertorevealstructurewithininthedata (patternrecognition, datavisualization)

Heuristic PCA: Maximum Variance View We wanttoreducethedimensionalityofourdataspace via a linear projection. But we still wanttokeeptheprojectedsamplesasdifferent aspossible. A goodmeasureforthisdifferenceisthedatacovarianceexpressedbythematrix Note: This expressesthecovariancebetween different datadimensions, notbetweendatapoints. Wenowaimtomaximizethevarianceoftheprojecteddata in theprojectionspacespannedby the basis vectors .

Maximum Variance View: The Maths Maximum varianceformulationof 1D-projection withprojectionvector: Constraint optimization: Leads to best projector being an eigenvector of , the data covariance matrix: with maximum projected variance equal to the maximum eigenvalue:

Heuristic PCA: Conclusion Byinductionweyieldthegeneral PCA resulttomaximizethevarianceofthedata in theprojecteddimensions: The projection vectors shallbetheeigenvectorscorrespondingtothelargesteigenvaluesofthedatacovariancematrix. These vectorsarecalledtheprincipalcomponents.

Heuristic PCA: Minimum error formulation By projecting, wewantto lose asfewinformationaspossible, i.e. keeptheprojecteddatapointsassimiliartotherawdataaspossible. Thereforeweminimizethemeanquadraticerror Withrespecttotheprojectionvectors. This leadstothe same resultas in themaximumvarianceformulation: shallbetheeigenvectorscorrespondingtothelargesteigenvaluesofthedatacovariancematrix.

Example: Eigenimages

Eigenimages II Christopher DeCorohttp://www.cs.princeton.edu/cdecoro/eigenfaces/

DimensionalityReduction

Probabilistic PCA: A synthesizer’sview • – standardised normal distribution • Independent latent variables withzeromean & unitvariance • – a sphericalGaussian • i.e. identicalindependentnoise in eachofthedatadimensions • Prior predictiveor marginal distributionofdatapoints:

Probabilistic PCA: ML-solution Same as in heuristic PCA matrix of first eigenvectors, diagonal matrix of eigenvalues Only specified up to a rotation in latent space

Recap: The EM-algorithm The Expectation-Maximizationalgorithmdeterminesthe Maximum Likelihood-solution forourmodelparametersiteratively Advantageouscomparedtodirecteigenvectordecomposition, if, i.e. ifwehaveconsiderablyfewer latent variables thandatadimensions Projection on a very low dimensional space, e.g. for data visualization to

EM-Algorithm: Expectation Step We consider the complete-data likelihood Maximizing the marginal likelihood insteadwouldneed an integrationover latent space E-Step: The posterior distribution of the latent variables is updated and used to calculate the expected value ofthecomplete-data log likelihoodwithrespectto Keeping estimates of fixed

EM-Algorithm: Maximization Step M-Step: The calculated expectation isnowmaximizedwithrespectto: keeping theestimatedposteriordistributionoffixedfromthe E-Step

EM-algorithmfor ML-PCA M E M Green dots: Data points, alwaysfixed Expectation: Redrodisfixed, cyanconnectionofbluespringsmoves obeying spring forces ( Maximization: Cyan connectionsarefixed, redrodmoves (obey spring forces)

Bayesian PCA – Findingthe real dimension Maximum Likelihood Bayesian PCA Estimated projection matrix for an dimensional latent variable modelandsyntheticdatageneratedfrom a latent modelwith Estimating Introducinghyperparameters, marginalizing:

Factor Analysis: A non-spherical PCA with) Noise is still independentandGaussian Controversy: Do thefactors (dimensionsof) have an interpretablemeaning? Problem: posterior invariant wrtrotationsof

Independent Component Analysis (ICA) with Still a linearmodelofindependentcomponents Nodatanoisecomponents, fordim(latent space) = dim(dataspace) Explicitly Non-Gaussian Otherwise, noseparationofmixingcoefficients in from latent variables wouldbepossible Rotationalsymmetry Maximizationof Non-Gaussianity/Independence Different criteria, e.g. kurtosis, skewness Minimizationof mutual information

ICA vs PCA Unsupervised method: No class labels! ICA rewards bi-modalityofprojecteddistribution PCA rewardsmaximumvariancebetweenelements ICA 1st independent component PCA 1st principalcomponent

Summary

Relation To Other Topics Today data-preprocessing Whitening via covariance => Identity Data representation/Feature extraction “Model-free” analysis Well: NO! We have seen the model assumptions in probabilistic PCA Dimensionality reduction Via projection on basis vectors carrying the most variance/leaving the smallest error At least for linear models, not for kernel PCA The matrix

Kernel PCA • Instead ofthe sample covariancematrix, wenowconsider a covariancematrix in a featurespace • As always, thekerneltrickof not computing in the high-dimensional featurespaceworks, becausethecovariancematrixonlyneedsscalarproductsofthe

Kernel PCA – Example: Gaussiankernel • Kernel PCA does not enabledimensionalityreduction via • is a manifold in featurespace, not a linear subspace • The PCA projectsontosubspaces in featurespacewithelements • These elementstypically not lie in , so theirpre-images ) will not be in dataspace

Pattern Recognition and Machine Learning