290 likes | 543 Views
Dimensionality Reduction. Given N vectors in n dims, find the k most important axes to project them k is user defined ( k < n ) Applications : information retrieval & indexing identify the k most important features or
E N D
Dimensionality Reduction • Given N vectors in n dims, find the k most important axes to project them • k is user defined (k < n) • Applications: information retrieval & indexing • identify the k most important features or • reduce indexing dimensions for faster retrieval (low dim indices are faster) Dimensionality Reduction
Techniques • Eigenvalue analysis techniques [NR’92] • Karhunen-Loeve (K-L) transform • Singular Value Decomposition (SVD) • both need O(N2) time • FastMap [Faloutsos & Lin 95] • dimensionality reduction and • mapping of objects to vectors • O(N) time Dimensionality Reduction
Mathematical Preliminaries • For an nxn square matrix S, for unit vector x and scalar valueλ: Sx = λx • x: eigenvector of S • λ: eigenvalue of S • The eigenvectors of a symmetric matrix (S=ST) are mutually orthogonal and its eigenvalues are real • r rank of a matrix: maximum number or independent columns or rows Dimensionality Reduction
Example 1 • Intuition: S defines an affine transform y = Sx that involves scaling, rotation • eigenvectors: unit vectors along the new directions • eigenvalues denote scaling eigenvector of major axis Dimensionality Reduction
Example 2 • If S is real and symmetric (S=ST) then it can be written as S = UΛUT • the columns of U are eigenvectors of S • U: column orthogonal (UUT=I) • Λ: diagonal with the eigenvalues of S Dimensionality Reduction
Karhunen-Loeve (K-L) • Project in a k-dimensional space (k<n) minimizing the error of the projections (sum. of sq. diffs) • K-L gives a linear combination of axes • sorted by importance • keep the first k dims 2-dim points and the 2 K-L directions for k=1 keep x’ Dimensionality Reduction
Computation of K-L • Put N vectors in rows in A=[aij] • Compute B=[aij-ap] , where • Covariance matrix: C=BTB • Compute the eigenvectors of C • Sort in decreasing eigenvalue order • Approximate each object by its projections on the directions of the first k eigenvectors Dimensionality Reduction
Intuition • B shifts the origin of the center of gravity of the vectors by ap and has 0 column mean • C represents attribute to attribute similarity • C square, real, symmetric • Eigenvector and eigenvalues are computed on C not on A • C denotes the affine transform that minimizes the error • Approximate each vector with its projections along the first k eigenvectors Dimensionality Reduction
Example • Input vectors [1 2], [1 1], [0 0] • Then col.avgs are 2/3 and 1 Dimensionality Reduction
SVD • For general rectangular matrixes • Nxn matrix (N vectors, n dimensions) • groups similar entities (documents) together • Groups similar terms together and each group of terms corresponds to a concept • Given an Nxn matrix A, write it as A = UΛVT • U: Nxr column orthogonal (r: rank of A) • Λ: rxr diagonal matrix (non-negative, desc. order) • V: rxn column orthogonal matrix Dimensionality Reduction
SVD (cont,d) • A = λ1u1v1T + λ2u2v2T + … + λrurvrT • u, v are column vectors of U, V • SVD identifies rect. blobs of related values in A • The rank r of A: number of blobs Dimensionality Reduction
Example • Two types of documents: CS and Medical • Two concepts (groups of terms) • CS: data, information, retrieval • Medical: brain, lung Dimensionality Reduction
Example (cont,d) U Vt Λ r=2 • U: document-to-document similarity matrix • V: term-to-document similarity matrix • v12 = 0: data has 0 similarity with the 2nd concept Dimensionality Reduction
SVD and LSI • SVD leads to “Latent Semantic Indexing” (http://lsi.research.telcordia.com/lsi/LSIpapers.html) • Terms that occur together are grouped into concepts • When a user searches for a term, the system determines the relevant concepts to search • LSI maps concepts to vectors in the concept space instead of the n-dim. document space • Concept space: is a lower dimensionality space Dimensionality Reduction
Examples of Queries • Find documents with the term“data” • Translate query vector q to concept space • The query is related to the CS concept and unrelated to the medical concept • LSI returns docs that also contain the terms “retrieval” and “information” which are not specified by the query Dimensionality Reduction
FastMap • Works with distances, has two roles: • Maps objects to vectors so that their distances are preserved (then apply SAMs for indexing) • Dim. Reduction: N vectors with n attributes each, find N vectors with k attributes such that distances are preserved as much as possible Dimensionality Reduction
Main idea • Pretend that objects are points in some unknown n-dimensional space • project these points on k mutually orthogonal axes • compute projections using distance only • The heart of FastMap is the method that projects two objects on a line • take 2 objects which are far apart (pivots) • project on the line that connects the pivots Dimensionality Reduction
Project Objects on a Line Apply cosine low: • Oa, Ob: pivots, Oi: any object • dij: shorthand for D(Oi,Oj) • xi: first coordinate on a k dimensional space • If Oiis close to Oa, xiis small Dimensionality Reduction
Choose Pivots • Complexity: O(N) • The optimal algorithm would require O(N2) time • steps 2,3 can be repeated 4-5 times to improve the accuracy of selection Dimensionality Reduction
Extension for Many Dimensions • Consider the (n-1)-dimensional hyperplane H that is perpendicular to line Oab • Project objects on H and apply previous step • choose two new pivots • the new xi is the next object coordinate • repeat this step until k dim. vectors are obtained • The distance on H is not D • D’: distance between projected objects Dimensionality Reduction
Distance on the Hyper-Plane H Pythagorean theorem: • D’ on H can be computed from the Pythagorean theorem • The ability to compute D’ allows for computing a second line on H etc. Dimensionality Reduction
Algorithm Dimensionality Reduction
Observations • Complexity: O(kN) distance calculations • k: desired dimensionality • k recursive calls, each takes O(N) • The algorithm records pivots in each call (dimension) to facilitate queries • the query is mapped to a k-dimensional vector by projecting it on the pivot lines for each dimension • O(1) computation/step: no need to compute pivots Dimensionality Reduction
Observations (cont,d) • The projected vectors can be indexed • mapping on 2-3 dimensions allows for visualization of the data space • Assumes Euclidean space (triangle rules) • not always true (at least after second step) • Approximation of pivots • some distances are negative • turn negative distances to 0 Dimensionality Reduction
Application: Document Vectors Dimensionality Reduction
FastMap on 10 documents for 2 & 3 dims (a) k = 2 and (b) k = 3 Dimensionality Reduction
References • Searching Multimedia Databases by Content, C. Faloutsos, Kluwer, 1996 • W. Press et.al. Numerical Recipes in C, Cambridge Univ. Press, 1988 • LSI website: http://lsi.research.telcordia.com/lsi/LSIpapers.html • C. Faloutsos, K.-Ip.Lin, FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets, Proc. of Sigmod, 1995 Dimensionality Reduction