Efficient classification for metric data

Efficient classification for metric data Lee-Ad Gottlieb Weizmann Institute Aryeh Kontorovich Ben Gurion U. Robert Krauthgamer Weizmann Institute TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA

Classification problem • Probabilistic concept learning • S is a set of n examples (x,y) drawn from X x {-1,1} according to some unknown probability distribution P. • The learner produces hypothesish: X → {-1,1} • A good hypothesis (classifier) minimizes the generalization error • P{(x,y): h(x) ≠ y} • A popular solution uses kernels • Data represented as vectors, kernels take the dot-product of vectors Efficient classification for metric data

Finite metric space • (X,d) is a metric space if • X = set of points • d = distance function • Nonnegative • Symmetric • Triangle inequality • Classification for metric data? • Problem: • No vector representation → • No notion of dot-product → • Can’t use kernels • What can be done in this setting? Tel-Aviv 95km 62km 151km Jerusalem Haifa Efficient classification for metric data

Preliminary definition • The Lipschitz constantL of a function f: X → R is the smallest value that satisfies for all points xi,xj in X • L ≥ |f(xi)-f(xj)| / d(xi,xj) • Consider a hypothesis consistent with all of S • Its Lipschitz constant is determined by the closest pair of differently labeled points • L ≥ 2 / d(xi,xj) for all xi in S−, xj in S+ Efficient classification for metric data

Classification for metric data • A powerful framework for this problem was introduced by von Luxburg & Bousquet (vLB, JMLR ‘04) • The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions • Given the classifier h, the problem of evaluating of h for new points in X reduces to the problem of finding a Lipschitz function consistent with h • Lipschitz extension problem, a classic problem in Analysis • For example • f(x) = mini [yi + 2d(x, xi)/d(S+,S−)] over all (xi,xj) in S • Function evaluation reduces to exact Nearest Neighbor Search (assuming zero training error) • Strong theoretical motivation for the NNS classification heuristic Efficient classification for metric data

Two new directions • The framework of vLB leaves open two further questions: • Efficient evaluation of the classifier h on X • In arbitrary metric space, exact NNS requires Θ(n) time • Can we do better? • Bias – variance tradeoff • Which sample points in S should h ignore? q ~1 ~1 -1 +1 Efficient classification for metric data

Doubling Dimension • Definition: Ball B(x,r) = all points within distance r from x. • The doubling constant(of a metric M) is the minimum value ¸>0such that every ball can be covered by ¸balls of half the radius • First used by [Ass-83], algorithmically by [Cla-97]. • The doubling dimension is dim(M)=log ¸(M) [GKL-03] • A metric is doubling if its doubling dimension is constant • Packing property of doubling spaces • A set with diameter D and min. inter-point distance a, contains at most (D/a)O(log¸)points Here ≤7. Efficient classification for metric data

Application I • We provide generalization bounds for Lipschitz functions on spaces with low doubling dimension • vLB provided similar bounds using covering numbers and Rademacher averages • Fat-shattering analysis: • Lipschitz function shatters a set → inter-point distance is at least 2/L • Packing property → set has (DL)O(log¸) points • So the fat-shattering dimension is low Efficient classification for metric data

Application I • Theorem: • For any f that classifies a sample of size n correctly, we have with probability at least 1− • P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log(578n) + log(4/)) . • Likewise, if f is correct on all but k examples, we have with probability at least 1− • P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2. • In both cases, d ≤ ⌈8LD]log¸+1. Efficient classification for metric data

Application II • Evaluation of h for new points in X • Lipschitz extension function • f(x) = mini [yi + 2d(x, xi)/d(S+,S−)] • Requires exact nearest neighbor search, which can be expensive! • New tool: (1+)-approximate nearest neighbor search • ¸O(1) log n + ¸O(-log) time • [KL-04, HM-05, BKL-06, CG-06] • If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of • g(x) = (1+) f(x) +  • h(x) = (1+) f(x) -  • Note that g(x) ≥ f(x) ≥ h(x) • g(x) and h(x) have Lipschitz constant (1+)L, so they and the approximate function generalizes well Efficient classification for metric data

Bias variance tradeoff • Which sample points in S should h ignore? • If f is correct on all but k examples, we have with probability at least 1− • P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2. • Where d ≤ ⌈8LD]¸+1. -1 +1 Efficient classification for metric data

Bias variance tradeoff • Algorithm • Fix a target Lipschitz constant L • O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible Efficient classification for metric data

Bias variance tradeoff • Algorithm • Fix a target Lipschitz constant L • Out of O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible • Minimum vertex cover • NP-Complete • Admits a 2-approximation in O(E) time Efficient classification for metric data

Bias variance tradeoff • Algorithm • Fix a target Lipschitz constant L • Out of O(n2) possibilities • Locate all pairs of points from S+ and S- whose distance is less than 2L • At least one of these points has to be taken as an error • Goal: Remove as few points as possible • Minimum vertex cover • NP-Complete • Admits a 2-approximation in O(E) time • Minimum vertex cover on a bipartite graph • Equivalent to maximum matching (Konig’s theorem) • Admits an exact solution in O(n2.376) randomized time Efficient classification for metric data

Bias variance tradeoff • Algorithm: • For each of O(n2) values of L • Run matching algorithm to find minimum error • Evaluate generalization bound for this value of L • O(n4.376) randomized time • Better algorithm • Binary search over O(n2) values of L • For each value • Run matching algorithm Find minimum error in O(n2.376 log n) randomized time Evaluate generalization bound for this value of L • Run greedy 2-approximation Approximate minimum error in O(n2 log n) time Evaluate approximate generalization bound for this value of L Efficient classification for metric data

Conclusion • Results: • Generalization bounds for Lipschitz classifiers in doubling spaces • Efficient evaluation of the Lipschitz extension hypothesis using approximate NNS • Efficient calculation of the bias variance tradeoff • Continuing research • Similar results for continuous labels Efficient classification for metric data

Efficient classification for metric data