440 likes | 893 Views
Lars Kasper, December 15 th 2010. Pattern Recognition and Machine Learning. Chapter 12: Continuous Latent Variables. Relation To Other Topics. Last weeks: Approximate Inference Today: Back to data-preprocessing Data representation/Feature extraction “Model-free” analysis
E N D
Lars Kasper, December 15th 2010 Pattern Recognition and Machine Learning Chapter 12: Continuous Latent Variables
Relation To Other Topics Last weeks: Approximate Inference Today: Back to data-preprocessing Data representation/Feature extraction “Model-free” analysis Dimensionality reduction The matrix Link: We also have a (particular easy) model of the underlying state of the world whose parameters we want to infer from the data
Take-home TLAs (Three-letter acronyms) Although termed “continuous latent variables”, we mainly deal with PCA (PrincipalComponent Analysis) ICA (Independent Component Analysis) Factoranalysis General motivation/theme: “What is interesting about my data – but hidden (latent)? … And what is just noise?”
Importance Sampling ;-) Publications concerning fMRIand (PCA or ICA orfactor Analysis) Source: ISI Web ofKnowledge, Dec 13th, 2010
Importance Sampling: fMRI Used for fMRI analysis, e.g. software package FSL: “MELODIC” MELODIC Tutorial: 2nd principal component (eigenimage) and corresponding time series of a visual block stimulation
Motivation: Low intrinsic dimensionality • Generating hand-written digit samples by translating and rotating one example 100 times • High dimensional data (100 x 100 pixel) • Low degrees of freedom (1 rotation angle, 2 translations)
Heuristic PCA: Projection View 2D-data Projected on 1D-line • How do we simplify or compress our data (make it low-dimensional) without losing actual information? • Dimensionality reduction by projecting on a linear subspace
Heuristic PCA: DimensionalityReduction • Advantages: • Reducedamountofdata • Mightbeeasiertorevealstructurewithininthedata (patternrecognition, datavisualization)
Heuristic PCA: Maximum Variance View We wanttoreducethedimensionalityofourdataspace via a linear projection. But we still wanttokeeptheprojectedsamplesasdifferent aspossible. A goodmeasureforthisdifferenceisthedatacovarianceexpressedbythematrix Note: This expressesthecovariancebetween different datadimensions, notbetweendatapoints. Wenowaimtomaximizethevarianceoftheprojecteddata in theprojectionspacespannedby the basis vectors .
Maximum Variance View: The Maths Maximum varianceformulationof 1D-projection withprojectionvector: Constraint optimization: Leads to best projector being an eigenvector of , the data covariance matrix: with maximum projected variance equal to the maximum eigenvalue:
Heuristic PCA: Conclusion Byinductionweyieldthegeneral PCA resulttomaximizethevarianceofthedata in theprojecteddimensions: The projection vectors shallbetheeigenvectorscorrespondingtothelargesteigenvaluesofthedatacovariancematrix. These vectorsarecalledtheprincipalcomponents.
Heuristic PCA: Minimum error formulation By projecting, wewantto lose asfewinformationaspossible, i.e. keeptheprojecteddatapointsassimiliartotherawdataaspossible. Thereforeweminimizethemeanquadraticerror Withrespecttotheprojectionvectors. This leadstothe same resultas in themaximumvarianceformulation: shallbetheeigenvectorscorrespondingtothelargesteigenvaluesofthedatacovariancematrix.
Eigenimages II Christopher DeCorohttp://www.cs.princeton.edu/cdecoro/eigenfaces/
Probabilistic PCA: A synthesizer’sview • – standardised normal distribution • Independent latent variables withzeromean & unitvariance • – a sphericalGaussian • i.e. identicalindependentnoise in eachofthedatadimensions • Prior predictiveor marginal distributionofdatapoints:
Probabilistic PCA: ML-solution Same as in heuristic PCA matrix of first eigenvectors, diagonal matrix of eigenvalues Only specified up to a rotation in latent space
Recap: The EM-algorithm The Expectation-Maximizationalgorithmdeterminesthe Maximum Likelihood-solution forourmodelparametersiteratively Advantageouscomparedtodirecteigenvectordecomposition, if, i.e. ifwehaveconsiderablyfewer latent variables thandatadimensions Projection on a very low dimensional space, e.g. for data visualization to
EM-Algorithm: Expectation Step We consider the complete-data likelihood Maximizing the marginal likelihood insteadwouldneed an integrationover latent space E-Step: The posterior distribution of the latent variables is updated and used to calculate the expected value ofthecomplete-data log likelihoodwithrespectto Keeping estimates of fixed
EM-Algorithm: Maximization Step M-Step: The calculated expectation isnowmaximizedwithrespectto: keeping theestimatedposteriordistributionoffixedfromthe E-Step
EM-algorithmfor ML-PCA M E M Green dots: Data points, alwaysfixed Expectation: Redrodisfixed, cyanconnectionofbluespringsmoves obeying spring forces ( Maximization: Cyan connectionsarefixed, redrodmoves (obey spring forces)
Bayesian PCA – Findingthe real dimension Maximum Likelihood Bayesian PCA Estimated projection matrix for an dimensional latent variable modelandsyntheticdatageneratedfrom a latent modelwith Estimating Introducinghyperparameters, marginalizing:
Factor Analysis: A non-spherical PCA with) Noise is still independentandGaussian Controversy: Do thefactors (dimensionsof) have an interpretablemeaning? Problem: posterior invariant wrtrotationsof
Independent Component Analysis (ICA) with Still a linearmodelofindependentcomponents Nodatanoisecomponents, fordim(latent space) = dim(dataspace) Explicitly Non-Gaussian Otherwise, noseparationofmixingcoefficients in from latent variables wouldbepossible Rotationalsymmetry Maximizationof Non-Gaussianity/Independence Different criteria, e.g. kurtosis, skewness Minimizationof mutual information
ICA vs PCA Unsupervised method: No class labels! ICA rewards bi-modalityofprojecteddistribution PCA rewardsmaximumvariancebetweenelements ICA 1st independent component PCA 1st principalcomponent
Relation To Other Topics Today data-preprocessing Whitening via covariance => Identity Data representation/Feature extraction “Model-free” analysis Well: NO! We have seen the model assumptions in probabilistic PCA Dimensionality reduction Via projection on basis vectors carrying the most variance/leaving the smallest error At least for linear models, not for kernel PCA The matrix
Kernel PCA • Instead ofthe sample covariancematrix, wenowconsider a covariancematrix in a featurespace • As always, thekerneltrickof not computing in the high-dimensional featurespaceworks, becausethecovariancematrixonlyneedsscalarproductsofthe
Kernel PCA – Example: Gaussiankernel • Kernel PCA does not enabledimensionalityreduction via • is a manifold in featurespace, not a linear subspace • The PCA projectsontosubspaces in featurespacewithelements • These elementstypically not lie in , so theirpre-images ) will not be in dataspace