Efficient Dimensionality Reduction Methods for Multivariate Data Analysis

Objectives of DR: CH.6: Dimensionality Reduction Analysis of large amounts of multivariate data raises the problem of DR 6.1 Introduction • Two main methods of dimensionality reduction: • Feature selection • Choosing k<d important features, e.g., 1 Reduces space complexity Reduces time complexity Data visualization

Subset selection: forward search, backward search, floating search • Feature extraction: Mapping data points from d-D to k-D space, where k<d, while preserving as many properties of the data as possible, e.g., 2 Unsupervised extraction: principal component analysis, factor analysis, multidimensional scaling, canonical correlation analysis Supervised extraction: isometric feature mapping, locally linear embedding, Laplacian eigenmaps.

6.2 Feature Selection - Subset Selection • Forward search: Add the best feature at each step • Initially, (F: feature set) • At each iteration, find the best new feature using a sample , where E(): error function • Add xj to F if • Backward search: Start with all features and remove one at a time. Remove xj from F if • Floating search: The numbers of added and removed features can changeat each step. 3

Example: Iris data (3 classes: , 4 features: ) Accuracy (F1, F2,F3,F4) Single feature 0.76 0.57 0.92 0.94 (chosen) 4

Add one more feature to F4 Accuracy 0.87 0.92 0.96 (chosen) Since the accuracies of (F1,F3,F4) and (F2,F3,F4) are both 0.94 smaller than 0.96. Stop the feature selection process at (F3,F4) . 5

6.3 Feature Extraction Principal Components Analysis (PCA) -- Linearly transforms a number of correlated features into the same number of uncorrelated features. • Recall that Uncorrelation: Independence: 6

Data vectors: Mean vectors: Covariance matrix: 7

Let be the eigenvalues and eigenvectors of Suppose Let The variances over y-axes = eigenvalues 8

The mean of y’s: The covariance of y’s: What PCA does? (1) centers the data at and transforms the data from to If the variances of y-axes are normalized by dividing with the square of eigenvalues, Euclidean distance can be used in this new space. 9

(2)Dimensionality reduction Let where Let be the reconstruction of from The representation error depends on which are relatively smaller than How to choose k? Proportion of Variance (PoV) Choose k at PoV>0.9 10

(3) Eigen faces 11 11

Spectral Decomposition Let E be a matrix whose ith column is the normalized eigenvector eiof matrix S . 12 where The spectral decomposition of matrix S . 12

Singular Value Decomposition real data matrix contains eigenvectors (in column) of contains eigenvectors (in column) of s.t. where :singular values of A • Let : eigenvectors of matrix : corresponding eigenvalues Let spectral decomposition 13

Compare with correspond to the singular The eigenvalues values of A. 6.4Feature Embedding (FE) -- Place d-D data points in a k-D space (k < d) such that pairwise similarities in the new space respect the original pairwise similarities. 14

Let be data matrix, be the eigenvalues and eigenvectors of , i.e., Multiply both side by X, are the eigenvalues and eigenvectors of called feature embedding ,form the coordinates of the new space. • is thecorrelation matrix of features is thesimilarity matrixof instances • When d < N, it simpler to work with When d > N, it easier to work with 15

6.5 Factor Analysis (FA) • In PCA, from the original features to • form a new set of features , which are • linear combinations of • In FA, a set of unobservable factors • combine to generate 16

Example: Let be the score variablesof Chinese, English, Mathematics, Physics, and Chemistry, respectively, which areobservable. Let be the talent variables (e.g., memory, inference, association, etc.), which are latent. Specifically, given the scores of a student what are loadings of factors of the student? 17

Given a sample with Find a small number of factors s.t. each can be written as a weighted sum of wherelatent factors factor loadings errors with In vector-matrix form, Two Uses of Factor Analysis: 1) Knowledge extraction, 2) Dimensionality reduction 18

Knowledge Extraction – Given xand z , find v assuming From Ignore Let S be the estimator of from a sample. Spectral decomposition of S: eigenvectors and eigenvalues of S, respectively. 19

Dimensionality Reduction – Given x, find z s.t. In vector-matrix form 20

Given a sample In vector-matrix form Ignore Solve for in which 21

S is the estimated covariance matrix obtained from the given sample X. where the eigenvectors and eigenvalues of S. 6.6 Matrix Factorization (MF) the dimensionality of the factor space. 22

G defines factors in terms of the original attributes. F defines data instances in terms of the factors. 6.7 Multidimensional Scaling (MDS) -- Given pairwise Euclidean distances of points in the d-D space, MDS projects the distances to a lower dimensional space. Let : a sample, where Two points: r and s, their squared Euclidean distance 23

where Let 24

Suppose data have been centered at the origin so that Likewise, From (C) From (D) 25

From (E) Let (known by calculation) 26

From (A), • In matrix form, Spectral decomposition of where Decide a dimensionality k <d based on The new coordinates of data point are given by 27

MDS Algorithm Given matrix where is the distance between data points r and s in the p-D space. 1. Find the spectral decomposition of D, 2. Discard from the small eigenvalues and from E the corresponding eigenvectors to form and , respectively. 3. Find The coordinates of the points are the rows of Z. Multidimensional Scaling, Michael A.A. Cox and T.E. Cox, 2006. 28

6.8 Linear Discriminant Analysis(LDA) In 2-Class(d-D to 1-D) case, find a direction w, such that when data are projected onto w, the examples of different classes are well-separated. -- Find a low dimension space such that when data are pojected onto it, the examples of different classes are as wellseparated as possible. Given a sample 29

Means: Scatters: • Find w that maximizes (between-class scatter matrix) 30

Similarly, where (Within-class scatter matrix) 31

Let • K>2 Class (d-D to k-D) Within-class scatter matrix: 32

Between-class scatter matrix: Let be the projection matrix from the d-Dspace to the k-Dspace (k < d), then For a scatter matrix, a measure of spread is its determinant. We want to find Wsuch that (generalized Rayleigh quotient) is maximized. 33

The determinant of a matrix is the product of its eigenvalues, i.e., The largest eigenvectors of SW-1SBform the columns of W. Fisher Discriminant Analysis with Kernels, S. Mika, G. Ratsch, J. Weaton, B. Scholkopf, and K.R. Muller, IEEE, 1999. 34

LDA PCA 35

6.9 Canonical Correlation Analysis (CCA) • Given a sample , both x and y are • inputs, e.g., (1) acoustic information and visual • information in speech recognition, (2) image data • and text annotations in retrieval application. • Take the correlation of into account while • reducing dimensionality to a joint space, i.e., 36 Find two vectors w and vs.t. when x is projected along w and y is projected along v, their correlation is maximized, where

where Covariance matrices: Cross-covariance matrices: 37

Let Choose (w, v) with the largest eigenvalue as the solution. Similarly, for other pairs with larger eigenvalues . 38 Solutions:

In matrix form, are the lower-dimensional representation of d-D space k-D space k < d 39

6.10 Isometric Feature Mapping (Isomap) • Manifold Poses of 5 different views 40

Each pose is described by four joints: right foot, left foot, lower right leg, lower left leg 41 The manifold of a side view The manifolds of the 5 views 41

Distance: Euclidean distance • Nodesr and s are connected if • For two nodes r and s not connected,their distance is equal to the length of the shortest path between them • Once thedistance matrix is formed, use MDS to find a lower-dimensional mapping. 42 Geodesic distance:the distance along the manifold that the datalies in

A Global Geometric Framework for Nonlinear Dimensionality Reduction, J.B. Tenenbaum, V. de Silva and J.C. Langford, Science, Vol 290, 2000. 6.11 Locally Linear Embedding (LLE) -- Recovers global nonlinear structure from locally linear fit. i) Local patch of the manifold is approximated linearly. ii) Each point is written as a linear, weighted sum of its neighbors. 43

Steps of LLE 44

, find Wrsby minimizing the error function Subject to 2. Find new coordinates zr that respect the constraints given by Wrs, subject to 45 1. Given xrand its neighbors xs(r)

where Solution: From the k+1 eigenvectors with the smallest eigrnvalues of ignore the lowest one because all its components are equal and the remaining k eigenvectors give the new coordinates. Nonlinear Dimensionality Reduction by Locally Linear Embedding, S.T. Roweis and L.K. Saul, Science, 290, 2000. 46

6.12 Laplacian Eigenmaps (LE) i.e., two similar instances (large ) should be close in the new space (small ); the less similar (small ), the more far apart (large ). • Define if xr and xs are in the predefined neighborhood, and 0 otherwise, i.e., only local similarities are cared. 47 Let xr and xs be two data instances and brs is their similarity. Find yr and ys that

Consider the 1-D new space where L : Laplacian matrix; is the The solution to eigenvector of L with the smallest nonzero eigenvalue. 48

Algorithm: Given k points 1. Put an edge between nodes if 2. Weight the edge by 3. For each connected component of G, compute generalized eigenvalues and eigenvectors of L, i.e., where D: diagonal matrix with L = D – B 49

Let be the solutions of (A) ordered from small to large eigenvalues. The images of embedded into are Laplace Eigenmaps for Dimensionality Reduction and Data Representation, M. Belkin, Neural Computing, 15, pp. 1373-1396, 2003. 50

Efficient Dimensionality Reduction Methods for Multivariate Data Analysis