1 / 37

LING 696B: PCA and other linear projection methods

LING 696B: PCA and other linear projection methods. Curse of dimensionality. The higher the dimension, the more data is needed to draw any conclusion Probability density estimation: Continuous: histograms Discrete: k-factorial designs Decision rules:

wind
Download Presentation

LING 696B: PCA and other linear projection methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING 696B: PCA and other linear projection methods

  2. Curse of dimensionality • The higher the dimension, the more data is needed to draw any conclusion • Probability density estimation: • Continuous: histograms • Discrete: k-factorial designs • Decision rules: • Nearest-neighbor and K-nearest neighbor

  3. How to reduce dimension? • Assume we know something about the distribution • Parametric approach: assume data follow distributions within a family H • Example: counting histograms for 10-D data needs lots of bins, but knowing it’s a pancake allows us to fit a Gaussian • (Number of bins)10 v.s. (10 + 10*11/2)

  4. Linear dimension reduction • Pancake/Gaussian assumption is crucial for linear methods • Examples: • Principle Components Analysis • Multidimensional Scaling • Factor Analysis

  5. Covariance structure of multivariate Gaussian • 2-dimensional example • No correlations --> diagonal covariance matrix, e.g. • Special case:  = I • - log likelihood  Euclidean distance to the center Variance in each dimension Correlation between dimensions

  6. Covariance structure of multivariate Gaussian • Non-zero correlations --> full covariance matrix, COV(X1,X2)  0 • E.g.  = • Nice property of Gaussians: closed under linear transformation • This means we can remove correlation by rotation

  7. Covariance structure of multivariate Gaussian • Rotation matrix: R = (w1, w2), where w1, w2 are two unit vectors perpendicular to each other • Rotation by 90 degree • Rotation by 45 degree w1 w2 w1 w2 w1 w2

  8. Covariance structure of multivariate Gaussian • Matrix diagonalization: any 2X2 covariance matrix A can be written as: • Interpretation: we can always find a rotation to make the covariance look “nice” -- no correlation between dimensions • This IS PCA when applied to N dimensions Rotation!

  9. w3 3-D: 3 coordinates w1 w2 Computation of PCA • The new coordinates uniquely identify the rotation • In computation, it’s easier to identify one coordinate at a time. • Step 1: centering the data • X <-- X - mean(X) • Want to rotate around the center

  10. w Computation of PCA • Step 2: finding a direction of projection that has the maximal “stretch” • Linear projection of X onto vector w: • Projw(X) = XNXd * wdX1 (X centered) • Now measure the stretch • This is sample variance = Var(X*w) X x w

  11. Computation of PCA • Step 3: formulate this as a constrained optimization problem • Objective of optimization: Var(X*w) • Need constraint on w: (otherwise can explode), only consider the direction • So formally:max||w||=1 Var(X*w), find w

  12. Computation of PCA • Some algebra (homework):Var(x) = E[(x - E[x])2 = E[x2] - (E[x])2 • Apply to matrices (homework)Var(X*w) = wT XT X w = wTCov(X) w (why) • Cov(X) is a dXd matrix (homework) • Symmetric (easy) • For any y, yTCov(X) y >= 0 (tricky)

  13. Computation of PCA • Going back to the optimization problem:= max||w||=1 Var(X*w)= max||w||=1 wTCOV(X) w • The answer is the largest eigenvalue for COV(X) w1 The first Principle Component! (see demo)

  14. More principle components • We keep looking among all the projections perpendicular to w1 • Formally:max||w2||=1,w2w1 wTCov(X) w • This turns out to be another eigenvector corresponding to the 2nd largest eigenvalue(see demo) w2 New coordinates!

  15. Rotation • Can keep going until we find all projections/coordinates w1,w2,…,wd • Putting them together, we have a big matrix W=(w1,w2,…,wd) • W is called an orthogonal matrix • This corresponds to a rotation (sometimes plus reflection) of the pancake • This pancake has no correlation between dimensions (see demo)

  16. When does dimension reduction occur? • Decomposition of covariance matrix • If only the first few ones are significant, we can ignore the rest, e.g. 2-D coordinates of X

  17. a2 a1 Measuring “degree” of reduction Pancake data in 3D

  18. Reconstruction from principle components • Perfect reconstruction (x centered): • Reconstruction error: Many pieces x x length w1 w2 direction The bigger pieces Another fomulation of PCA

  19. “neural firing” Connection weights A creative interpretation/ implementation of PCA • Any x can be reconstructed from principle components (PC form a basis for the whole space) “encoding” Output X W When (# of hidden) < (# of input), the network does dimension reduction hidden = W Input X This can be used to implement PCA

  20. An intuitive application of PCA:(Story and Titze) and others • Vocal tract measurements are high dimensional (different articulators) • Measurements from different positions are correlated • Underlying geometry: a few articulatory parameters, possibly pancake-like after collapsing a number of different sounds • Big question: relate low-dimensional articulatory parameters (tongue shape) to low dimensional acoustic parameters (F1/F2)

  21. Story and Titze’s application of PCA • Source data: area function data obtained from MRI (d=44) • Step 1: Calculate the mean • Interestingly, the mean produces a schwa-like frequency response

  22. Story and Titze’s application of PCA • Step 2: substract the mean from the area function (center the data) • Step 3: form the covariance matrix • R = XTX (dXd matrix), X ~ (x, p)

  23. Story and Titze’s application of PCA • Step 4: eigen-decomposition of the covariance matrix, get PC’s • Story calls them “empirical modes” • Length of projection: • Reconstruction:

  24. Expansion > 0 Constriction < 0 Story and Titze’s application of PCA • Story’s principle components • The first 2 PC’s can do most of the reconstruction • Can be seen as a perturbation of overall tongue shape (from the mean)

  25. Story and Titze’s application of PCA • The principle components are interpretable as control parameters • Acoustic-to-Articulatory mapping almost one-to-one after dimension reduction

  26. Applying PCA to ultrasound data? • Another imaging technique • Generate a tongue profile similar to X-ray and MRI • High-dimensional • Correlated • Need dimension reduction to interpret articulatory parameters • See demo

  27. An unintuitive application of PCA • Latent Semantic Indexing in document retrieval • Documents as vectors of word counts • Try to extract some “features” by linear combination of word counts • The underlying geometry unclear (mean? Distance?) • The meaning of principle components unclear (rotation?) “market” “stock” “bonds”

  28. Summary of PCA: • PCA looks for: • A sequence of linear, orthogonal projections that reveal interesting structure in data (rotation) • Defining “interesting”: • Maximal variance under each projection • Uncorrelated structure after projection

  29. Departure from PCA • 3 directions of divergence • Other definitions of “interesting”? • Linear Discriminant Analysis • Independent Component Analysis • Other methods of projection? • Linear but not orthogonal: sparse coding • Implicit, non-linear mapping • Turning PCA into a generative model • Factor Analysis

  30. Re-thinking “interestingness” • It all depends on what you want • Linear Disciminant Analysis (LDA): supervised learning • Example: separating 2 classes Maximal separation Maximal variance

  31. Re-thinking “interestingness” • Most high-dimensional data look like Gaussian under linear projections • Maybe non-Gaussian is more interesting • Independent Component Analysis • Projection pursuits • Example: ICA projection of 2-class data Most unlike Gaussian (e.g. maximize kurtosis)

  32. x w2 w3 w1 w4 The “efficient coding” perspective • Sparse coding: • Projections do not have to be orthogonal • There can be more basis vectors than the dimension of the space • Neural interpretation (Dana Ballard’s talk last week) Basis expansion p << d; compact coding (PCA) p > d; sparse coding

  33. “Interesting” can be expensive • Often faces difficult optimization problems • Need many constraints • Lots of parameter sharing • Expensive to compute, no longer an eigenvalue problem

  34. PCA’s relatives: Factor Analysis • PCA is not a generative model: reconstruction error is not likelihood • Sensitive to outliers • Hard to build into bigger models • Factor Analysis: adding a measurement noise to account for variability observation Measurement noise N(0,R), R diagonal Loading matrix (scaled PC’s) Factors: spherical Gaussian N(0,I)

  35. PCA’s relatives: Factor Analysis • Generative view: sphere --> stretch and rotate --> add noise • Learning: a version of EM algorithm (see demo and synthesis)

  36. Mixture of Factor Analyzers • Same intuition as other mixture models: there may be several pancakes out there, each with its own center/rotation

  37. PCA’s relatives: Metric multidimensional scaling • Approach the problem in a different way • No measurements from stimuli, but pairwise “distance” between stimuli • Intend to recover some psychological space for the stimuli • See Jeff’s talk

More Related