580 likes | 614 Views
Dimensionality Reduction and Embeddings. SVD: The mathematical formulation. Normalize the dataset by moving the origin to the center of the dataset Find the eigenvectors of the data (or covariance) matrix These define the new space Sort the eigenvalues in “ goodness ” order. f2. e1. e2.
E N D
SVD: The mathematical formulation • Normalize the dataset by moving the origin to the center of the dataset • Find the eigenvectors of the data (or covariance) matrix • These define the new space • Sort the eigenvalues in “goodness” order f2 e1 e2 f1
Compute Approximate SVD efficiently • Exact SVD is expensive: O(min{n2 m, n m2}) So, we try to compute it approximately. We exploit the fact that, if A = ULVT then: AAT = UL2UT and ATA = V L2VT 1. Random projection + SVD. Cost O(m n logn) 2. Random sampling ( p rows) and then SVD on the samples. Cost O(max{m p2+p3}) or O(p4)!! (caution: constants can be high!)
Approximate SVD • We can guarantee an approximation like the following: || A – P||2F <= || A – Ak||2F+ e ||A||2F
SVD Cont’d • Advantages: • Optimal dimensionality reduction (for linear projections) • Disadvantages: • Computationally expensive… but can be improved with random sampling • Sensitive to outliers and non-linearities
Embeddings • Given a metric distance matrix D, embed the objects in a k-dimensional vector space using a mapping F such that • D(i,j) is close to D’(F(i),F(j)) • Isometric mapping: • exact preservation of distance • Contractive mapping: • D’(F(i),F(j)) <= D(i,j) • D’ is some Lp measure
Multi-Dimensional Scaling (MDS) • Map the items in a k-dimensional space trying to minimize the stress • Steepest Descent algorithm: • Start with an assignment • Minimize stress by moving points • But the running time is O(N2) and O(N) to add a new item • Another method: stress iterative majorization
FastMap What if we have a finite metric space (X, d )? Faloutsos and Lin (1995) proposed FastMap as metric analogue to the PCA. Imagine that the points are in a Euclidean space. • Select two pivot pointsxa and xb that are far apart. • Compute a pseudo-projection of the remaining points along the “line”xaxb . • “Project” the points to an orthogonal subspace and recurse.
FastMap • We want to find e1 first f2 e1 e2 f1
Selecting the Pivot Points The pivot points should lie along the principal axes, and hence should be far apart. • Select any point x0. • Let x1 be the furthest from x0. • Let x2 be the furthest from x1. • Return (x1, x2). x2 x0 x1
Pseudo-Projections xb Given pivots (xa , xb ), for any third point y, we use the law of cosines to determine the relation of y along xaxb . The pseudo-projection for y is This is first coordinate. db,y da,b y cy da,y xa
“Project to orthogonal plane” xb cz-cy Given distances along xaxb we can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem. Using d ’(.,.), recurse until k features chosen. dy,z z y xa y’ z’ d’y’,z’
Compute the next coordinate • Now, we have projected all objects into a subspace orthogonal to first dimension (line xa,xb) • We can apply recursively FastMap on the new projected dataset: FastMap(k-1, d’, D)
Embedding using ML • We can try to use some learning techniques to “learn” the best mapping. • Works for general metric spaces, not only “Euclidean spaces” • Vassilis Athitsos, Jonathan Alon, Stan Sclaroff, George Kollios: BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 30(1): 89-104 (2008)
x1 x2 x3 xn BoostMap Embedding database
x1 x2 x3 xn x1 x2 x3 x4 xn Embeddings database Rd embedding F
x1 x2 x3 xn x1 x2 x3 x4 xn q Embeddings database Rd embedding F query
x1 x2 x3 xn x1 x2 x3 x4 xn q q Embeddings database Rd embedding F query
x2 x3 x1 xn x4 x3 x2 x1 xn q q Embeddings • Measure distances between vectors (typically much faster). database Rd embedding F query
x2 x3 x1 xn x4 x3 x2 x1 xn q q Embeddings • Measure distances between vectors (typically much faster). • Caveat: the embedding must preserve similarity structure. database Rd embedding F query
Reference Object Embeddings original space X
Reference Object Embeddings r original space X r: reference object
Reference Object Embeddings r original space X r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.
Reference Object Embeddings r F original space X Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.
Reference Object Embeddings • F(r) = D(r,r) = 0 r F original space X Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.
Reference Object Embeddings • F(r) = D(r,r) = 0 • If a and b are similar, their distances to r are also similar (usually). r b a F original space X Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.
Reference Object Embeddings • F(r) = D(r,r) = 0 • If a and b are similar, their distances to r are also similar (usually). r b a F original space X Real line r: reference object Embedding: F(x) = D(x,r) D: distance measure in X.
F(x) = D(x, Lincoln) F(Sacramento)....= 1543 F(Las Vegas).....= 1232 F(Oklahoma City).= 437 F(Washington DC).= 1207 F(Jacksonville)..= 1344
F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando)) F(Sacramento)....= ( 386, 1543, 2920) F(Las Vegas).....= ( 262, 1232, 2405) F(Oklahoma City).= (1345, 437, 1291) F(Washington DC).= (2657, 1207, 853) F(Jacksonville)..= (2422, 1344, 141)
F(x) = (D(x, LA), D(x, Lincoln), D(x, Orlando)) F(Sacramento)....= ( 386, 1543, 2920) F(Las Vegas).....= ( 262, 1232, 2405) F(Oklahoma City).= (1345, 437, 1291) F(Washington DC).= (2657, 1207, 853) F(Jacksonville)..= (2422, 1344, 141)
Basic Questions What is a good way to optimize an embedding?
Basic Questions F(x) = (D(x, LA), D(x, Denver), D(x, Boston)) What is a good way to optimize an embedding? What are the best reference objects? What distance should we use in R3?
Key Idea • Embeddings can be seen as classifiers. • Embedding construction can be seen as a machine learning problem. • Formulation is natural. • We optimize exactly what we want to optimize.
F Rd original space X Ideal Embedding Behavior a q Notation: NN(q) is the nearest neighbor of q. For any q: if a = NN(q), we want F(a) = NN(F(q)).
A Quantitative Measure b a q If b is not the nearest neighbor of q, F(q) should be closer to F(NN(q)) than to F(b). For how many triples (q, NN(q), b) does F fail?
A Quantitative Measure a q F fails on five triples.
b a q Embeddings Seen As Classifiers Classification task: is q closer to a or to b?
b a q Embeddings Seen As Classifiers Classification task: is q closer to a or to b? • Any embedding F defines a classifier F’(q, a, b). • F’ checks if F(q) is closer to F(a) or to F(b).
b a q Classifier Definition Classification task: is q closer to a or to b? • Given embedding F: X Rd: • F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||. • F’(q, a, b) > 0 means “q is closer to a.” • F’(q, a, b) < 0 means “q is closer to b.”
Classifier Definition Goal: build an F such that F’ has low error rate on triples of type (q, NN(q), b). • Given embedding F: X Rd: • F’(q, a, b) = ||F(q) – F(b)|| - ||F(q) – F(a)||. • F’(q, a, b) > 0 means “q is closer to a.” • F’(q, a, b) < 0 means “q is closer to b.”
1D Embeddings as Weak Classifiers • 1D embeddings define weak classifiers. • Better than a random classifier (50% error rate). • We can define lots of different classifiers. • Every object in the database can be a reference object. Question: how do we combine many such classifiers into a single strong classifier?
1D Embeddings as Weak Classifiers • 1D embeddings define weak classifiers. • Better than a random classifier (50% error rate). • We can define lots of different classifiers. • Every object in the database can be a reference object. Question: how do we combine many such classifiers into a single strong classifier? Answer: use AdaBoost (uses boosting) • AdaBoost is a machine learning method designed for exactly this problem.
Fn F2 F1 Using AdaBoost original space X Real line • Output: H = w1F’1 + w2F’2 + … + wdF’d . • AdaBoost chooses 1D embeddings and weighs them. • Goal: achieve low classification error. • AdaBoost trains on triples chosen from the database.
From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output What embedding should we use? What distance measure should we use?
From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)).
D((u1, …, ud), (v1, …, vd)) = i=1wi|ui – vi| d From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)). Distance measure
D((u1, …, ud), (v1, …, vd)) = i=1wi|ui – vi| d From Classifier to Embedding H = w1F’1 + w2F’2 + … + wdF’d AdaBoost output BoostMap embedding F(x) = (F1(x), …, Fd(x)). Distance measure Claim: Let q be closer to a than to b. H misclassifies triple (q, a, b) if and only if, under distance measure D, F maps q closer to b than to a.
i=1 i=1 i=1 d d d Proof H(q, a, b) = = wiF’i(q, a, b) = wi(|Fi(q) - Fi(b)| - |Fi(q) - Fi(a)|) = (wi|Fi(q) - Fi(b)| - wi|Fi(q) - Fi(a)|) = D(F(q), F(b)) – D(F(q), F(a)) = F’(q, a, b)
Significance of Proof • AdaBoost optimizes a direct measure of embedding quality. • We have converted a database indexing problem into a machine learning problem.