700 likes | 879 Views
Jeff Hansen Senior Data Engineer. April 2013. Demystifying Dimensionality Reduction. Demystifying Dimensionality Reduction. A Tribute to Johnson and Lindenstrauss. Who is this?. What is this?. How about this?. Hint: It’s for kids…. Some Perspectives are Better than Others.
E N D
Jeff Hansen Senior Data Engineer April 2013 Demystifying Dimensionality Reduction
Demystifying Dimensionality Reduction A Tribute to Johnson and Lindenstrauss
How about this? • Hint: It’s for kids…
Some Perspectives are Better than Others Getting a better look
Great, but… • What does this have to do with Machine Learning? • How can this help me visualize my data? • How do I use this to recommend new products to new customers? • Can this help me detect fraud?
Dimensions in Data Going beyond 3-D
Samples and Variables Samples are things. Things have numerous: • Features • Characteristics • Attributes • Variables • aka Dimensions
Distance and Similarity If we • Treat each feature like a dimension • Treat each item like a point Then • Similar items are closer together • Dissimilar items are further apart
Measures of Distance Various measures of distance with scary math names: • Euclidean Distance • Maximum Distance • Manhattan Distance • L(n) Norm
Curse of Dimensionality • You think more than 3 dimensions are hard? Try a couple million… • Calculating similarity becomes increasingly difficult as a feature set grows.
Reduce the Number of Dimensions Johnson-Lindenstrauss Theorem • Number of Dimensions doesn’t matter, the sample size does – approximate item similarity can be maintained with a number of dimensions on the order of log(n) the number of points. English? • Every time you double the number of points you only need to add a constant number of additional dimensions.
This is worth Repeating The number of dimensions doesn’t matter. If all you care about is item similarity, you can project an INFINITE number of dimensions onto a lower number of dimensions based on the number of points you want to compare.
Feature Extraction • What if there were unrecorded variables that explain the variables we can see? • Dimensionality Reduction techniques extractthese hidden variables or features. • For example, Topics explain the appearance of words in documents, Genres explain the movies that people watch.
Sounds Great! But how do I do it?
Singular Value What? Unfortunately, the techniques come with tongue twisting unintuitive names: • SVD – Singular Value Decomposition • PCA – Principle Component Analysis • LSA – Latent Semantic Analysis • LDA – Linear Discriminant Analysis • Random Projections • MinHash
A Brief Refresher of Linear Algebra Don’t Panic!
Vectors and Projections *Image courtesy of Wikipedia: http://en.wikipedia.org/wiki/File:3D_Vector.svg
Vector “dot” Products • A . B = (a1 * b1) + (a2 * b2) + (a3 * b3) • A . B = || A || * || B || * cosθ If B is a unitvector (it has a length of 1) then the result is simply the length of A projected onto the line (or dimension) formed by B. Remember that a “good” projection is one where the angle is close to zero, so that cosθis close to 1 and the dot product of A and B is approximately the length of A. This is like projecting the face of a coin onto a surface that’s parallel to the face of the coin – that would be a good projection.
Matrix Multiplication Cell 1,1 = Row 1 times Column 1 = (a1,1 x b1,1) + (a1,2 x b2,1) + (a1,3 x b3,1) Cell 1,2 = Row 1 times Column 2 = … …
Matrix Division? What if you could factor a matrix?
Matrix Division? What if you could factor a matrix? You Can! Matrix Decompositions: • LU Decomposition • QR Decomposition • Eigen Decomposition • Singular Value Decomposition
Why would you Want to? 1,000,000 x 1,000,000 = 1,000,000,000,000 100 X 1,000,000 + 100 x 1,000,000 = 200,000,000 That’s a MUCH smaller representation!
Factors as Basis for new Space Suppose Cis a Matrix of people who have watched movies. Every Row represents a person and ever column represents a movie. If we can find matrices A and B where A x B approximates C: • Each row of A models a person • The distance between two rows of A models relative similarity • Each column of B models a movie • The distance between two columns of B models relative similarity
Big Data, Smaller Models Movies … People … … …
A = U Σ V* • U and V are square orthonormal matrices – rows and columns are all unit vectors. • Σ is a rectangular diagonal matrix with values decreasing from left to right. • U and V can be viewed as projection matrices, Σ as a scaling matrix. • Earlier columns of U and V* capture most of the “action” of A. • If Σ “decays” quickly enough, most of U and V* is insignificant and can be thrown away without significantly affecting the model.
Using “Cubic” Visualization Dark grey indicates zero or very small values. A U Σ V*
Σ A V* U As columns of U get multiplied by decreasing singular values, the result is smaller column vectors.
A U Σ V*
A U V* Σ
U Σ V* = A Σ V* U A