250 likes | 463 Views
Efficient classification for metric data. Lee-Ad Gottlieb Hebrew U. Aryeh Kontorovich Ben Gurion U. Robert Krauthgamer Weizmann Institute. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A. Classification problem. +1. -1.
E N D
Efficient classification for metric data Lee-Ad Gottlieb Hebrew U. Aryeh Kontorovich Ben Gurion U. Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA
Classification problem +1 -1 • A fundamental problem in learning: • Point space X Probability distribution P on X x {-1,1} • Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X • Produces hypothesish: X → {-1,1} with empirical error and true error • Goal: uniformly over h in probability 2 Efficient classification for metric data
Classification problem +1 -1 • A fundamental problem in learning: • Point space X Probability distribution P on X x {-1,1} • Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X • Produces hypothesish: X → {-1,1} with empirical error and true error • Goal: uniformly over h in probability 3 Efficient classification for metric data
Classification problem +1 -1 • A fundamental problem in learning: • Point space X Probability distribution P on X x {-1,1} • Learner observes sample S of n points (x,y) drawn iid ~P Wants to predict labels of other points in X • Produces hypothesish: X → {-1,1} with empirical error and true error • Goal: uniformly over h in probability 4 Efficient classification for metric data
Generalization bounds • How do we upper bound the true error? • Use a generalization bound. Roughly speaking (and whp) true error ≤ empirical error + (complexity of h)/n • More complex classifier ↔ “easier” to fit to arbitrary data • VC-dimension: largest point set that can be shattered by h +1 +1 -1 -1 5
Popular approach for classification • Assume the points are in Euclidean space! • Pros • Existence of inner product • Efficient algorithms (SVM) • Good generalization bounds (max margin) • Cons • Many natural settings non-Euclidean • Euclidean structure is a strong assumption • Recent popular focus • Metric space data Efficient classification for metric data
Metric space חיפה • (X,d) is a metric space if • X = set of points • d() = distance function • nonnegative • symmetric • triangle inequality • inner product → norm • norm → metric • But ⇐ doesn’t hold 95km תל אביב 208km 113km באר שבע Efficient classification for metric data
Classification for metric data? • Advantage: often much more natural • much weaker assumption • strings • Images (earthmover distance) • Problem: no vector representation • No notion of dot-product (and no kernel) • What to do? • Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion! • Use some NN heuristic?.. NN classifier has ∞ VC-dim! Efficient classification for metric data
Preliminaries: Lipschitz constant • The Lipschitz constantL of a function f: X → R measures its smoothness • It is the smallest value L that satisfies for all points xi,xj in X • Denoted by • Suppose hypothesis h: S → {-1,1} is consistent with sample S • Its Lipschitz constant of h is determined by the closest pair of differently labeled points • Or equivalently ≥ 2/d(S+,S−) -1 +1 Efficient classification for metric data
Preliminaries: Lipschitz extension Lipschitz extension: A classic problem in Analysis given a function f: S → Rfor S inX, extend f to all of X without increasing the Lipschitz constant Example: Points on the real line f(1) = 1 f(-1) = -1 credit: A. Oberman 10 Efficient classification for metric data
Classification for metric data • A powerful framework for metric classification was introduced by von Luxburg & Bousquet (vLB, JMLR ‘04) • Construction of h onS: The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions • Estimation of h on X: The problem of evaluating h for new points in X reduces to the problem of finding a Lipschitz function consistent with h • Lipschitz extension problem • For example f(x) = mini [f(xi) + 2d(x, xi)/d(S+,S−)] over all (xi,yi) in S • Evaluation of h reduces to exact Nearest Neighbor Search • Strong theoretical motivation for the NNS classification heuristic Efficient classification for metric data
Two new directions • The framework of [vLB ‘04] leaves open two further questions: • Constructing h: handling noise • Bias-Variance tradeoff • Which sample points in S should h ignore? • Evaluating h on X • In arbitrary metric space, exact NNS requires Θ(n) time • Can we do better? -1 +1 q ~1 ~1 Efficient classification for metric data
Doubling dimension • Definition: Ball B(x,r) = all points within distance r from x. • The doubling constant(of a metric M) is the minimum value >0such that every ball can be covered by balls of half the radius • First used by [Assoud ‘83], algorithmically by [Clarkson ‘97]. • The doubling dimension is ddim(M)=log2(M) • A metric is doubling if its doubling dimension is constant • Euclidean: ddim(Rd) = O(d) • Packing property of doubling spaces • A set with diameter diam and minimum inter-point distance a, contains at most (diam/a)O(ddim)points Here ≥7. Efficient classification for metric data
Applications of doubling dimension • Major application to databases • Recall that exact NNS requires Θ(n) time in arbitrary metric space • There exists a linear size structure that supports approximate nearest neighbor search in time 2O(ddim) log n • Database/network structures and tasks analyzed via the doubling dimension • Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06] • Image recognition (Vision) [KG --] • Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b] • Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11] • Clustering [Tal ‘04, ABS ‘08, FM ‘10] • Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08] • Further applications • Travelling Salesperson [Tal ‘04] • Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11] • Machine learning [BLL ‘09, KKL ‘10, KKL --] • Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10] • Message: This is an active line of research… 16
Our dual use of doubling dimension • Interestingly, considering the doubling dimension yields contributes in two different areas • Statistical: Function complexity • We bound the complexity of the hypothesis in terms of the doubling dimension of X and the Lipschitz constant of the classifier h • Computational: efficient approximate NNS Efficient classification for metric data
Statistical contribution • We provide generalization bounds for Lipschitz functions on spaces with low doubling dimension • vLB provided similar bounds using covering numbers and Rademacher averages • Fat-shattering analysis: • L-Lipschitz functions shatter a set → inter-point distance is at least 2/L • Packing property → set has (diam L)O(ddim) points • This is the fat-shattering dimension of the classifier on the space, and is a good measure of its complexity. Efficient classification for metric data
Statistical contribution • [BST ‘99]: • For any f that classifies a sample of size n correctly, we have with probability at least 1− • P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log2(578n) + log(4/)) . • Likewise, if f is correct on all but k examples, we have with probability at least 1− • P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2. • In both cases, d is bound by the fat-shattering dimension, d ≤ (diam L)ddim + 1 • Done with the statistical contribution … On to the computational contribution. Efficient classification for metric data
Computational contribution • Evaluation of h for new points in X • Lipschitz extension function • f(x) = mini [yi + 2d(x, xi)/d(S+,S−)] • Requires exact nearest neighbor search, which can be expensive! • New tool: (1+)-approximate nearest neighbor search • 2O(ddim) log n + O(-ddim) time • [KL ‘04, HM ‘06, BKL ‘06, CG ‘06] • If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of • g(x) = (1+) f(x) + • e(x) = (1+) f(x) - • Note that g(x) ≥ f(x) ≥ e(x) • g(x) and e(x) have Lipschitz constant (1+)L, so they and the approximate function generalizes well 2 g(x) f(x) e(x) Efficient classification for metric data
Final problem: bias variance tradeoff • Which sample points in S should h ignore? • If f is correct on all but k examples, we have with probability at least 1− • P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2. • Where d ≤ (diam L)ddim + 1 -1 +1 Efficient classification for metric data
Structural Risk Minimization • Algorithm • Fix a target Lipschitz constant L • O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible +1 -1 Efficient classification for metric data
Structural Risk Minimization • Algorithm • Fix a target Lipschitz constant L • O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible • Minimum vertex cover • NP-Complete • Admits a 2-approximation in O(E) time +1 -1 Efficient classification for metric data
Structural Risk Minimization • Algorithm • Fix a target Lipschitz constant L • O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible • Minimum vertex cover • NP-Complete • Admits a 2-approximation in O(E) time • Minimum vertex cover on a bipartite graph • Equivalent to maximum matching (Konig’s theorem) • Admits an exact solution in O(n2.376) randomized time [MS ‘04] +1 -1 Efficient classification for metric data
Efficient SRM • Algorithm: • For each of O(n2) values of L • Run matching algorithm to find minimum error • Evaluate generalization bound for this value of L • O(n4.376) randomized time • Better algorithm • Binary search over O(n2) values of L • For each value • Run greedy 2-approximation Approximate minimum error in O(n2 log n) time Evaluate approximate generalization bound for this value of L Efficient classification for metric data
Conclusion • Results: • Generalization bounds for Lipschitz classifiers in doubling spaces • Efficient evaluation of the Lipschitz extension hypothesis using approximate NNS • Efficient Structural Risk Minimization • Continuing research: Continuous labels • Risk bound via the doubling dimension • Classifier h determined via an LP • Faster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer constraints, each variable appears in bounded number of constraints. Efficient classification for metric data
Application: earthmover distance S T Efficient classification for metric data